MinaProtocol / mina

Mina is a cryptocurrency protocol with a constant size blockchain, improving scaling while maintaining decentralization and security.
https://minaprotocol.com
Apache License 2.0
1.99k stars 529 forks source link

Investigate why some builds take much longer to finish #11307

Closed joseandro closed 2 years ago

joseandro commented 2 years ago

Investigate the root cause of the few builds that take too long (>250m) to finish.

matheus-o1labs commented 2 years ago

This is the overall scenario:

We have a long running process that is taking longer than expected. Sometimes it takes 5hs, other 7hs and recently I found an execution with more than 9 hours. However eventually it works faster with less than 40 minutes.

image.png

image.png

So it looks like 2 things are happening.

1: BK is trying to send the test executive a signal, IDK why or what the signal is. 2: the job is ignoring the signal

[90m# Received cancellation signal, interrupting[0m Terminated it looks like the integration test framework does process the "cancellation signal" but then BK keeps sending it signals

matheus-o1labs commented 2 years ago

In the logs I see the code is sending an exit signal in a loop. BK should have caught this and terminated the job. The exit code is 20563. Are you familiar with this?

[0;36m[2022-7-14 17:14:53.750926]DuneexeTest_executive: initializing log engine [0m

[0;35m[2022-7-14 17:14:58.649633]Integration_test_cloud_engine__Stack_driver_log_engine: Event subscription created [0m

[0;36m[2022-7-14 17:14:58.649669]DuneexeTest_executive: beginning to process network events [0m

[0;36m[2022-7-14 17:14:58.649794]DuneexeTest_executive: initializing dsl [0m

[0;36m[2022-7-14 17:14:58.649900]DuneexeTest_executive: initializing network abstraction [0m

[0;35m[2022-7-14 17:14:58.649940]Integration_test_cloud_engine__Kubernetes_network: Waiting for pods to be assigned nodes and become ready [0m

!!! === COMMAND ===

!!! gcloud "pubsub" "subscriptions" "pull" "projects/o1labs-192920/subscriptions/it-auto-b9498af8-zkapps" "--auto-ack" "--limit" "5" "--format" "table(DATA)"

!!! === STDOUT ===

!!!

!!! === STDERR ===

!!! ERROR: (gcloud.pubsub.subscriptions.pull) NOT_FOUND: Resource not found (resource=projects/o1labs-192920/subscriptions/it-auto-b9498af8-zkapps).

!!!

[0;31m[2022-7-14 17:15:11.374655]DuneexeTest_executive: (monitor.ml.Error"command exited with status code 1"("Raised at BaseError.raise in file \"src/error.ml\" (inlined), line 8, characters 14-30""Called from BaseOr_error.ok_exn in file \"src/or_error.ml\", line 84, characters 17-32""Called from Async_kernel__Deferred1.M.map.(fun) in file \"src/deferred1.ml\", line 17, characters 40-45""Called from Async_kernelJob_queue.run_job in file \"src/job_queue.ml\" (inlined), line 128, characters 2-5""Called from Async_kernelJob_queue.run_jobs in file \"src/job_queue.ml\", line 169, characters 6-47""Caught by monitor at file \"src/app/test_executive/test_executive.ml\", line 306, characters 27-27")) [0m

[0;35m[2022-7-14 17:15:11.374729]DuneexeTest_executive: cleaning up testnet (reason: "exception thrown") [0m

[0;35m[2022-7-14 17:15:29.501451]Integration_test_cloud_engine__Kubernetes_network: Pods assigned to nodes [0m

[0;35m[2022-7-14 17:15:29.501496]DuneexeTest_executive: Starting the daemons within the pods [0m

[0;35m[2022-7-14 17:15:29.501505]DuneexeTest_executive: starting seed-dbfd45497-k8g9b ... [0m

[0;35m[2022-7-14 17:15:31.108661]DuneexeTest_executive: seed-dbfd45497-k8g9b started [0m

[0;35m[2022-7-14 17:15:31.108728]Integration_test_lib__Dsl: Waiting for [seed-dbfd45497-k8g9b] to initialize (soft_timeout: "10m", hard_timeout: "15m") [0m

[90m# Received cancellation signal, interrupting [0m Terminated

++ kill -- 20563

+++ kill -- 20563

++++ kill -- 20563

+++++ kill -- 20563

++++++ kill -- 20563

matheus-o1labs commented 2 years ago

--- evidence ---

https://buildkite.com/organizations/o-1-labs-2/pipelines/mina/builds/24126/jobs/0181fdb4-42d1-4104-8102-261f39c6693d/raw_log

https://buildkite-cloud.s3.amazonaws.com/logs-by-pipeline/8f4b7485-ef17-469a-bec2-221aef440bff/01810640-7539-4306-8a80-3c1fc48be4ee/0181068e-c619-4014-929b-ef3a5efb3e9e.log?response-content-disposition=inline&response-content-type=text%2Fplain&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAQPCP3C7L3A4KQTJA%2F20220715%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220715T191749Z&X-Amz-Expires=600&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEHIaCXVzLWVhc3QtMSJGMEQCIGe%2FqGU6t5y3JoFKqs8GEPK7CtQ5dXfbWe6z0M6S9E4uAiApaWFkpu6heNhDiKJsiLOVl3ib8KJmBuQn8M92YHlBDCrbBAiq%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAAaDDAzMjM3OTcwNTMwMyIMUPRQxKzqEH39S%2BqvKq8E0H%2BtYhJb%2FSqKyq%2BCD1mHGC0%2Fj5dVB56x3csdVKlOksMpHOTouzUWOXSyPSoIbr7rW2Kv5UBrP%2BsUSuCi%2F1o9YBYk3xlBwwjWf77htnB%2FkCA13F21Tu7tGvGEt94yXIRzulxNWONY3JjIaUneNgeYBYg1Ql2pYzIekZ4WAV%2FXCpV21OueYH8YIsjPLUWmg%2FmIkdqPzVpIozTGMkdPWlJ4sjBdVdrLIN5C4aemsnczD5TeGYlBxeC9eLWXPctcgk702mIQsXvAh0EilprCZI%2F3gUS8lF4NWF2PncWjY5M5xCXNrQXj0sXg%2BhIFrBw90oJwoigwbMvZHssBX6ak0E5qifW88vcoEpoLLwLe8RRglPz5RbWSaG7jpy84qF1v63m95NJKhMbfL61CnKIv7LcqGUgiEe9aqCZged34qXrKRPMfKkc%2FdcaUnotl8Kk95p%2BBQYYXQw4Ay0TEuFTO%2F0QcIc%2Bfjx9c6td9deRip4%2FivJi%2F%2F1Zg6Yc%2B%2FzukjlqhIFwM3dOvL%2FQ6sLhvlqWES2a7%2FRk0pUeGZkbjwTyvVOjV8IKGVmqfx0YsODlbG%2Bz%2FtJ9eqxsJMyWG%2FL0rZNn%2BSRqWe7QeEWNa%2Blx3AdGZpVwwK86B%2FVAw270SlHZPnlVar%2BDwKYBsdogL49ln5Wkpbn7p6%2BGjmV6rnNoi0HTFmKyUPTYWeJsJZj5ynggXxmv3jyq0TTTTouPS%2FbmimjNaE7Z0WX7EJjQOa8qPsRWEj9VVEjCAwMaWBjqqATcFFQJvf%2F9vuXrDLAT1XX7%2BKHF4hsOYjFxPre4DUi7czWgXl3m5B6QUHNXQfRztgYQYCZJbDFmVH9SjGXAApyfxRn0kDBc47BNWWU6OUhBTym1X%2FruyzuVzir%2FwiCwupF4GyB1ldf3vHj9c3gLs1%2B9Rp38I48pA7G99PccRtFzXxBf8y7S50uGNmyrh9O46zmk%2FFYEyoaPwk5zbvtappj3RPtbI3gjqM0qs&X-Amz-SignedHeaders=host&X-Amz-Signature=6f0ee439fd357e05fd93d7c1df761eb9c47221a1eb181afbf52dce15df5aac79

matheus-o1labs commented 2 years ago

Tasks:

matheus-o1labs commented 2 years ago

Fast lint steps; binable compatibility changes,feature/compatible-hashing-3,0:01:16.219000 :pipeline:,feature/compatible-hashing-3,0:00:57.172000 Prepare monorepo triage,feature/compatible-hashing-3,0:00:46.026000 Monorepo triage,feature/compatible-hashing-3,0:00:42.072000 Zkapps test transaction tool unit tests,feature/compatible-hashing-3,0:18:05.723000 Archive node unit tests,feature/compatible-hashing-3,0:08:07.797000 Connect to testnet,feature/compatible-hashing-3,0:20:30.296000 SnarkyJS unit tests,feature/compatible-hashing-3,0:20:26.940000 Install Yarn dependencies,feature/compatible-hashing-3,0:00:25.423000 Build client SDK and run unit tests,feature/compatible-hashing-3,0:12:51.376000 Prepublish client SDK packages,feature/compatible-hashing-3,0:00:27.549000 Check GraphQL Schema,feature/compatible-hashing-3,0:19:40.372000 Fuzzy zkapp unit tests,feature/compatible-hashing-3,0:25:31.043000 Compare test signatures,feature/compatible-hashing-3,0:09:04.875000 Shared state integration test (migrated),feature/compatible-hashing-3,0:16:17.151000 dev unit-tests,feature/compatible-hashing-3,0:41:00.824000 Build test-executive,feature/compatible-hashing-3,0:06:05.741000 Build JS integration tests,feature/compatible-hashing-3,0:10:44.781000 peers-reliability integration test,feature/compatible-hashing-3,0:35:31.549000 chain-reliability integration test,feature/compatible-hashing-3,0:35:13.596000 payment integration test,feature/compatible-hashing-3,0:28:01.936000 delegation integration test,feature/compatible-hashing-3,0:26:19.190000 gossip-consis integration test,feature/compatible-hashing-3,0:28:00.502000 opt-block-prod integration test,feature/compatible-hashing-3,0:14:19.289000 medium-bootstrap integration test,feature/compatible-hashing-3,0:41:32.798000 zkapps integration test,feature/compatible-hashing-3,0:34:26.665000 zkapps-timing integration test,feature/compatible-hashing-3,0:39:50.259000 snarkyjs integration test,feature/compatible-hashing-3,0:29:54.502000 Replayer test,feature/compatible-hashing-3,0:22:22.299000 Publish client SDK to npm,feature/compatible-hashing-3,0:00:27.256000 Build Libp2p helper for Bionic,feature/compatible-hashing-3,0:00:50.406000 Build Mina for Bionic,feature/compatible-hashing-3,0:11:45.697000 Docker: daemon-devnet-bionic-docker-image,feature/compatible-hashing-3,0:15:45.225000 Docker: daemon-berkeley-bionic-docker-image,feature/compatible-hashing-3,0:15:51.028000 Docker: daemon-mainnet-bionic-docker-image,feature/compatible-hashing-3,0:16:12.257000 Docker: archive-bionic-docker-image,feature/compatible-hashing-3,0:19:47.718000 Docker: rosetta-bionic-docker-image,feature/compatible-hashing-3,0:33:52.974000 Docker: zkapp-test-transaction-bionic-docker-image,feature/compatible-hashing-3,0:14:14.169000 Build Libp2p helper for Stretch,feature/compatible-hashing-3,0:02:45.532000 Build Mina for Stretch,feature/compatible-hashing-3,0:14:00.291000 Docker: daemon-devnet-stretch-docker-image,feature/compatible-hashing-3,0:18:13.843000 Docker: daemon-berkeley-stretch-docker-image,feature/compatible-hashing-3,0:17:49.196000 Docker: daemon-mainnet-stretch-docker-image,feature/compatible-hashing-3,0:20:34.604000 Docker: archive-stretch-docker-image,feature/compatible-hashing-3,0:04:11.142000 Docker: rosetta-stretch-docker-image,feature/compatible-hashing-3,0:41:40.020000 Docker: zkapp-test-transaction-stretch-docker-image,feature/compatible-hashing-3,0:18:34.576000 Build Libp2p helper for Focal,feature/compatible-hashing-3,0:02:52.265000 Build Mina for Focal,feature/compatible-hashing-3,0:12:07.833000 Docker: daemon-devnet-focal-docker-image,feature/compatible-hashing-3,0:16:10.836000 Docker: daemon-berkeley-focal-docker-image,feature/compatible-hashing-3,0:16:21.350000 Docker: daemon-mainnet-focal-docker-image,feature/compatible-hashing-3,0:16:13.254000 Docker: archive-focal-docker-image,feature/compatible-hashing-3,0:15:24.911000 Docker: rosetta-focal-docker-image,feature/compatible-hashing-3,0:45:33.608000 Docker: zkapp-test-transaction-focal-docker-image,feature/compatible-hashing-3,0:14:37.830000 Build Libp2p helper for Bullseye,feature/compatible-hashing-3,0:01:02.584000 Build Mina for Bullseye,feature/compatible-hashing-3,0:20:13.967000 Docker: daemon-devnet-bullseye-docker-image,feature/compatible-hashing-3,0:24:41.720000 Docker: daemon-berkeley-bullseye-docker-image,feature/compatible-hashing-3,0:24:44.280000 Docker: daemon-mainnet-bullseye-docker-image,feature/compatible-hashing-3,0:24:42.308000 Docker: archive-bullseye-docker-image,feature/compatible-hashing-3,0:23:04.673000 Docker: rosetta-bullseye-docker-image,feature/compatible-hashing-3,0:26:15.772000 Docker: zkapp-test-transaction-bullseye-docker-image,feature/compatible-hashing-3,0:22:48.751000 Build Libp2p helper for Buster,feature/compatible-hashing-3,0:00:50.238000 Build Mina for Buster,feature/compatible-hashing-3,0:10:03.569000 Docker: daemon-devnet-buster-docker-image,feature/compatible-hashing-3,0:14:19.330000 Docker: daemon-berkeley-buster-docker-image,feature/compatible-hashing-3,0:13:55.349000 Docker: daemon-mainnet-buster-docker-image,feature/compatible-hashing-3,0:14:07.735000 Docker: archive-buster-docker-image,feature/compatible-hashing-3,0:13:15.160000 Docker: rosetta-buster-docker-image,feature/compatible-hashing-3,0:28:54.115000 Docker: zkapp-test-transaction-buster-docker-image,feature/compatible-hashing-3,0:12:05.257000 Check merges cleanly into compatible,feature/compatible-hashing-3,0:00:13.321000 Check merges cleanly into develop,feature/compatible-hashing-3,0:00:22.119000 "OCaml Lints; Check-format, Require-ppx-version",feature/compatible-hashing-3,0:07:56.758000 "Fast lint steps; CODEOWNERs, RFCs, Check Snarky Submodule, Preprocessor Deps",feature/compatible-hashing-3,0:00:06.590000 Fast lint steps; versions compatibility changes,feature/compatible-hashing-3,0:04:01.900000 Fast lint steps; binable compatibility changes,feature/compatible-hashing-3,0:01:16.219000

joseandro commented 2 years ago

Shouldn't this be in review?

matheus-o1labs commented 2 years ago

Closed for now as this was due in most part by node preemptive in google cloud.