MystenLabs / sui

Sui, a next-generation smart contract platform with high throughput, low latency, and an asset-oriented programming model powered by the Move programming language
https://sui.io
Apache License 2.0
6.24k stars 11.2k forks source link

benchmark failed due to connection lost between primary and workers #5188

Open grandchildrice opened 2 years ago

grandchildrice commented 2 years ago

Description

Hello. I am writing a research paper on narwhal in the collage in Japan. Narwhal and other DAG based mempool data structures are very interesting and I hope there will be a revolution in blockchain once sui is mainnet launched. I would very much like to contribute to narwhal in the research process.

I cloned the main branch of this repository only 3 hours ago and ran benchmark and got an error.

command lines

$ fab local
Starting local benchmark
Setting up testbed...
About to run ['cargo', 'build', '--quiet', '--release', '--features', 'benchmark']...
Running benchmark (20 sec)...
Parsing logs...

ERROR: Failed to run benchmark

Caused by:
  0: <class 'benchmark.logs.ParseError'>
  1: Primary(s) panicked

logs

benchmark/logs/primary-3.log

2022-10-01T10:39:10.982702Z DEBUG process_vote{vote_digest=6criwpeaOMdj2jF6vkDnv81xg+isJT4Zd9oJy7/ygtk=}:process_certificate{certificate_digest=6criwpeaOMdj2jF6vkDnv81xg+isJT4Zd9oJy7/ygtk=}: narwhal_primary::core: Processing 6criwpeaOMdj2jF6: C7(pywHohDdqTG6KjdH7fTo5yHgnhyLIeUOC/bvz4BDqVVK9MSUqSDcUpAV0it0Z8IPA2wmHVIZ/Moful9DTjzAhffZaicsznP7kxx6kvbEp75PDfXMdyCl0j586c8LUdSX, r9pMDUppR7o9toHi, E0) round:7
2022-10-01T10:39:10.982709Z DEBUG request{route=/narwhal.PrimaryToPrimary/SendMessage version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.982718Z DEBUG request{route=/narwhal.PrimaryToPrimary/SendMessage version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.982746Z DEBUG request{route=/narwhal.PrimaryToPrimary/SendMessage version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.982915Z DEBUG request{route=/narwhal.PrimaryToPrimary/SendMessage version=V1}: anemo_tower::trace::on_response: finished processing request latency=0 ms status=200
2022-10-01T10:39:10.982923Z DEBUG process_vote{vote_digest=6criwpeaOMdj2jF6vkDnv81xg+isJT4Zd9oJy7/ygtk=}:process_certificate{certificate_digest=6criwpeaOMdj2jF6vkDnv81xg+isJT4Zd9oJy7/ygtk=}: narwhal_primary::core: Pruned 1 messages from obsolete rounds.
2022-10-01T10:39:10.982944Z DEBUG narwhal_consensus::bullshark: Processing 6criwpeaOMdj2jF6: C7(pywHohDdqTG6KjdH7fTo5yHgnhyLIeUOC/bvz4BDqVVK9MSUqSDcUpAV0it0Z8IPA2wmHVIZ/Moful9DTjzAhffZaicsznP7kxx6kvbEp75PDfXMdyCl0j586c8LUdSX, r9pMDUppR7o9toHi, E0)
2022-10-01T10:39:10.982958Z DEBUG narwhal_primary::proposer: Got enough support for leader pywHohDdqTG6KjdH7fTo5yHgnhyLIeUOC/bvz4BDqVVK9MSUqSDcUpAV0it0Z8IPA2wmHVIZ/Moful9DTjzAhffZaicsznP7kxx6kvbEp75PDfXMdyCl0j586c8LUdSX at round 7
2022-10-01T10:39:10.983001Z DEBUG request{route=/narwhal.PrimaryToPrimary/SendMessage version=V1}: anemo_tower::trace::on_response: finished processing request latency=0 ms status=200
2022-10-01T10:39:10.987726Z DEBUG request{route=/narwhal.PrimaryToPrimary/SendMessage version=V1}: anemo_tower::trace::on_response: finished processing request latency=5 ms status=200
2022-10-01T10:39:10.989026Z DEBUG connection-manager{peer=d84c6cdc}: anemo::network::connection_manager: new connection peer_id=5cc2fbbd5be3bdde64a7c5d1d7b4f7c3450169913a76db8f20e00d149ee45c6e
2022-10-01T10:39:10.989037Z DEBUG connection-manager{peer=d84c6cdc}: anemo::network::connection_manager: closing old connection with PeerId(5cc2fbbd5be3bdde64a7c5d1d7b4f7c3450169913a76db8f20e00d149ee45c6e) to mitigate simultaneous dial
2022-10-01T10:39:10.989216Z  INFO anemo::network::request_handler: InboundRequestHandler started peer=5cc2fbbd5be3bdde64a7c5d1d7b4f7c3450169913a76db8f20e00d149ee45c6e
2022-10-01T10:39:10.989232Z ERROR fetch_payload{digest=B29vPhYNLLldJ+Xi worker_id=0}:try_fetch_locally{digest=B29vPhYNLLldJ+Xi worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989234Z ERROR request{route=/narwhal.PrimaryToWorker/SendMessage version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=20 ms
2022-10-01T10:39:10.989382Z ERROR request{route=/narwhal.PrimaryToWorker/SendMessage version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=20 ms
2022-10-01T10:39:10.989413Z  INFO anemo::network::request_handler: InboundRequestHandler ended peer=5cc2fbbd5be3bdde64a7c5d1d7b4f7c3450169913a76db8f20e00d149ee45c6e
2022-10-01T10:39:10.98942Z ERROR fetch_payload{digest=B29vPhYNLLldJ+Xi worker_id=0}:try_fetch_locally{digest=B29vPhYNLLldJ+Xi worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.989456Z ERROR fetch_payload{digest=AUNoqxv8aG8eqLxU worker_id=0}:try_fetch_locally{digest=AUNoqxv8aG8eqLxU worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989466Z ERROR fetch_payload{digest=AUNoqxv8aG8eqLxU worker_id=0}:try_fetch_locally{digest=AUNoqxv8aG8eqLxU worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.989483Z ERROR fetch_payload{digest=w0f9gXoVQAknASdc worker_id=0}:try_fetch_locally{digest=w0f9gXoVQAknASdc worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989491Z ERROR fetch_payload{digest=w0f9gXoVQAknASdc worker_id=0}:try_fetch_locally{digest=w0f9gXoVQAknASdc worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.989509Z ERROR fetch_payload{digest=BkUUjYj5yesQHeCM worker_id=0}:try_fetch_locally{digest=BkUUjYj5yesQHeCM worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989516Z ERROR fetch_payload{digest=BkUUjYj5yesQHeCM worker_id=0}:try_fetch_locally{digest=BkUUjYj5yesQHeCM worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.989531Z ERROR fetch_payload{digest=p89tKWEvyzh6HlIB worker_id=0}:try_fetch_locally{digest=p89tKWEvyzh6HlIB worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989538Z ERROR fetch_payload{digest=p89tKWEvyzh6HlIB worker_id=0}:try_fetch_locally{digest=p89tKWEvyzh6HlIB worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.989553Z ERROR fetch_payload{digest=czFghGtiytnhyKok worker_id=0}:try_fetch_locally{digest=czFghGtiytnhyKok worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989561Z ERROR fetch_payload{digest=czFghGtiytnhyKok worker_id=0}:try_fetch_locally{digest=czFghGtiytnhyKok worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.989576Z ERROR fetch_payload{digest=RDWP/oijJxK6bNVp worker_id=0}:try_fetch_locally{digest=RDWP/oijJxK6bNVp worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989584Z ERROR fetch_payload{digest=RDWP/oijJxK6bNVp worker_id=0}:try_fetch_locally{digest=RDWP/oijJxK6bNVp worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.989599Z ERROR fetch_payload{digest=4AaOJI8hZtw6TuaX worker_id=0}:try_fetch_locally{digest=4AaOJI8hZtw6TuaX worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989605Z ERROR fetch_payload{digest=4AaOJI8hZtw6TuaX worker_id=0}:try_fetch_locally{digest=4AaOJI8hZtw6TuaX worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.98962Z ERROR fetch_payload{digest=Yae5SkRrv0WCWkKV worker_id=0}:try_fetch_locally{digest=Yae5SkRrv0WCWkKV worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989627Z ERROR fetch_payload{digest=Yae5SkRrv0WCWkKV worker_id=0}:try_fetch_locally{digest=Yae5SkRrv0WCWkKV worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.989672Z ERROR fetch_payload{digest=p2dwJS967bKH8bno worker_id=0}:try_fetch_locally{digest=p2dwJS967bKH8bno worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989685Z ERROR fetch_payload{digest=p2dwJS967bKH8bno worker_id=0}:try_fetch_locally{digest=p2dwJS967bKH8bno worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.989701Z ERROR fetch_payload{digest=gRJrZrwOq+Pm6jR+ worker_id=0}:try_fetch_locally{digest=gRJrZrwOq+Pm6jR+ worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989709Z ERROR fetch_payload{digest=gRJrZrwOq+Pm6jR+ worker_id=0}:try_fetch_locally{digest=gRJrZrwOq+Pm6jR+ worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.989724Z ERROR fetch_payload{digest=cn6mbIPbTkqKgddk worker_id=0}:try_fetch_locally{digest=cn6mbIPbTkqKgddk worker_id=0}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_failure: response failed error=connection lost latency=23 ms
2022-10-01T10:39:10.989732Z ERROR fetch_payload{digest=cn6mbIPbTkqKgddk worker_id=0}:try_fetch_locally{digest=cn6mbIPbTkqKgddk worker_id=0}: narwhal_executor::subscriber: Error communicating with out own worker: Network error Status { status: Unknown, headers: {}, message: Some("unknown error: connection lost"), source: Some(connection lost

Caused by:
    closed) }
2022-10-01T10:39:10.990924Z DEBUG fetch_payload{digest=B29vPhYNLLldJ+Xi worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=ZzzoHVtjsbngK+RjMbxjL45tB5rax+V37DpTiX70deY= digest=B29vPhYNLLldJ+Xi}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.99095Z DEBUG fetch_payload{digest=AUNoqxv8aG8eqLxU worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=UsKfY0ci6HxxDYED90lyl1BAXwkyoYcF8zEbu6IYtLA= digest=AUNoqxv8aG8eqLxU}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.990972Z DEBUG fetch_payload{digest=w0f9gXoVQAknASdc worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=UsKfY0ci6HxxDYED90lyl1BAXwkyoYcF8zEbu6IYtLA= digest=w0f9gXoVQAknASdc}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.991081Z DEBUG fetch_payload{digest=BkUUjYj5yesQHeCM worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=RS93l/HXOQz8x5mbalugJZFEsWn9yx20MKJ7H4cSD04= digest=BkUUjYj5yesQHeCM}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.991148Z DEBUG fetch_payload{digest=p89tKWEvyzh6HlIB worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=ZzzoHVtjsbngK+RjMbxjL45tB5rax+V37DpTiX70deY= digest=p89tKWEvyzh6HlIB}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.991202Z DEBUG fetch_payload{digest=czFghGtiytnhyKok worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=ZzzoHVtjsbngK+RjMbxjL45tB5rax+V37DpTiX70deY= digest=czFghGtiytnhyKok}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.991242Z DEBUG fetch_payload{digest=RDWP/oijJxK6bNVp worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=XML7vVvjvd5kp8XR17T3w0UBaZE6dtuPIOANFJ7kXG4= digest=RDWP/oijJxK6bNVp}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.991294Z DEBUG fetch_payload{digest=4AaOJI8hZtw6TuaX worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=UsKfY0ci6HxxDYED90lyl1BAXwkyoYcF8zEbu6IYtLA= digest=4AaOJI8hZtw6TuaX}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.991367Z DEBUG fetch_payload{digest=Yae5SkRrv0WCWkKV worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=RS93l/HXOQz8x5mbalugJZFEsWn9yx20MKJ7H4cSD04= digest=Yae5SkRrv0WCWkKV}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.991415Z DEBUG fetch_payload{digest=p2dwJS967bKH8bno worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=ZzzoHVtjsbngK+RjMbxjL45tB5rax+V37DpTiX70deY= digest=p2dwJS967bKH8bno}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.991457Z DEBUG fetch_payload{digest=gRJrZrwOq+Pm6jR+ worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=XML7vVvjvd5kp8XR17T3w0UBaZE6dtuPIOANFJ7kXG4= digest=gRJrZrwOq+Pm6jR+}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request
2022-10-01T10:39:10.991504Z DEBUG fetch_payload{digest=cn6mbIPbTkqKgddk worker_id=0}:fetch_from_worker{stagger_delay=0ns worker=UsKfY0ci6HxxDYED90lyl1BAXwkyoYcF8zEbu6IYtLA= digest=cn6mbIPbTkqKgddk}:request{route=/narwhal.PrimaryToWorker/RequestBatch version=V1}: anemo_tower::trace::on_request: started processing request

I have tried several builds and sometimes the same ERROR is output in Worker's logs. According to logs, I can consider that executor/src/subscriber.rs try_fetch_locally() fails to call network.request_batch.

It works When I comment out the following codes.

benchmark/benchmark/logs.py

    def _parse_primaries(self, log):
        # if search(r'(?:panicked|ERROR)', log) is not None:
        #     raise ParseError('Primary(s) panicked')

    def _parse_workers(self, log):
        # if search(r'(?:panic|ERROR)', log) is not None:
        #     # raise ParseError('Worker(s) panicked')

output

Starting local benchmark
Setting up testbed...
About to run ['cargo', 'build', '--quiet', '--release', '--features', 'benchmark']...
Running benchmark (20 sec)...
Parsing logs...
WARN: Clients missed their target rate 90 time(s)

-----------------------------------------
 SUMMARY:
-----------------------------------------
 + CONFIG:
 Faults: 0 node(s)
 Committee size: 4 node(s)
 Worker(s) per node: 1 worker(s)
 Collocate primary and workers: True
 Input rate: 50,000 tx/s
 Transaction size: 512 B
 Execution time: 19 s

 Header size: 1,000 B
 Max header delay: 200 ms
 GC depth: 50 round(s)
 Sync retry delay: 10,000 ms
 Sync retry nodes: 3 node(s)
 batch size: 500,000 B
 Max batch delay: 200 ms
 Max concurrent requests: 500,000

 + RESULTS:
 Consensus TPS: 37,564 tx/s
 Consensus BPS: 19,232,809 B/s
 Consensus latency: 680 ms

 End-to-end TPS: 37,252 tx/s
 End-to-end BPS: 19,072,943 B/s
 End-to-end latency: 2,498 ms
-----------------------------------------

Steps to reproduce

pc spec
Device Macbook Pro M1 13-inch 2020
OS Mac OS Monterey 12.2.1
CPU Apple M1
Memory 16GB
Storage 512GB

network condition

dev dependencies
shell zsh
Python 3.7.9
rustup 1.25.1
rustc 1.63.0
cargo 1.63.0

commands

$ cd benchmark
$ fab local
github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.