Closed c29r3 closed 2 years ago
2 times received the error described above and then node does not reboot more than a day
Are you sure this is not related to max ram allocated to Docker ?
Are you sure this is not related to max ram allocated to Docker ?
I have not set a RAM limit
And here's the reason. Daemon tried to allocate 14649452 GB memory (14.63 PB)
Attempted to allocate 14649452162192538 bytesFatal error: out of memory.
Coda process exited with status code 2
+ echo 'Coda process exited with status code 2'
+ sleep 10
+ kill 14
+ '[' '!' -f stay_alive ']'
+ exit 0
+ mkdir -p .coda-config
+ touch .coda-config/coda-prover.log
+ touch .coda-config/coda-verifier.log
+ touch .coda-config/mina-best-tip.log
+ command=daemon
+ shift
Another crash with this ~136GB
Attempted to allocate 136752041404 bytesFatal error: out of memory.
I have too some problem.
: 2021-01-17 19:08:20 UTC [Error] Duplicate producer and slot: producer = $block_producer, block_producer: "B62qqLmebnE45d2myqMM5YNstVudto7VNzJmNNHKp4VyEmAbjXK3kRq" consensus_time: { "slot_number": "8637", "slots_per_epoch": "7140" } hash: "3NLfJ1nNhDzpkh5iuY36kh6Jo1scrqVfp73mNQbY36nP4zGPswi1" current_protocol_state_hash: "3NKYqMaCHqtKYdik9yz6MJn32YvN1KVixPmW86K6PHRTiRMwsXZP" Attempted to allocate 137386859548 bytesFatal error: out of memory.
Same problem here with the latest build
Commit [DIRTY]a9402473584c36b347d52df4f4000b7286385987 on branch HEAD
Attempted to allocate 136348821002 bytesFatal error: out of memory.
Which is almost 128 GB
Another occurrence.
Attempted to allocate 136703156765 bytesFatal error: out of memory.
on version
Commit [DIRTY]a9402473584c36b347d52df4f4000b7286385987 on branch HEAD
With this latest version
Commit [DIRTY]d075f83d26490f6510bbb14bbfe3c771256257b5 on branch HEAD
Had this just now
Attempted to allocate 136771063836 bytesFatal error: out of memory.
With the Encore testnet version
Commit [DIRTY]3ef86631e3a38150b5092faec47da144b0a46020 on branch HEAD
I've seen two occurrences in the past 24 hours.
Attempted to allocate 137127597980 bytesFatal error: out of memory.
Attempted to allocate 137065315144 bytesFatal error: out of memory.
On the Zenith testnet
Commit [DIRTY]245a3f7d883c516f5f16742cb1ca672872612851 on branch HEAD
got
Attempted to allocate 91978576875 bytesFatal error: out of memory.
Star.LI#0785 had this on Encore.
Attempted to allocate 137288557616 bytesFatal error: out of memory.
My node is crashed by out of memory. Anyone else has the same issue?
BTW, my node is shipped with 128G memory.
https://discord.com/channels/484437221055922177/754653322845356103/813240010231775254
I just had this on Zenith.
Commit [DIRTY]245a3f7d883c516f5f16742cb1ca672872612851 on branch HEAD
2021-02-22 03:44:04 UTC [Info] Received a block from $sender
sender: {
"Remote": {
"host": "88.198.26.117",
"peer_id": "12D3KooWH3kYuWRDnBDLn7z6xH9xTdnfRPZUDMrQYTbLnzPCZwEq",
"libp2p_port": 8302
}
}
Attempted to allocate 137351565754 bytesFatal error: out of memory.
Another occurrence reported on discord. https://discord.com/channels/484437221055922177/812104065168310303/813821331606601743
I had this happen not many hours ago.
Ledger Merkle root: jxdBddivtWL6zZwhdBDuK9ibpngS7P4rh73FR1jcUJbXwMUvddg Protocol state hash: 3NLWJdXYMrUi1dQkBGvG2Hpgv6qPkXZJJyMEUDwWNBWULxwViHx3 Chain id: 90b71f6f798dec88a1afc825cd0b358c6d8a3ff3c0b57a7fe97412ea5a639c2b Git SHA-1: fd3980820fb82c7355af49462ffefe6718800b77
Mar 7 15:17:18 mina-testworld mina[18179]: 2021-03-07 14:17:18 UTC [Info] Received a block from $sender Mar 7 15:17:18 mina-testworld mina[18179]: #011sender: { Mar 7 15:17:18 mina-testworld mina[18179]: "Remote": { Mar 7 15:17:18 mina-testworld mina[18179]: "host": "135.181.3.211", Mar 7 15:17:18 mina-testworld mina[18179]: "peer_id": "12D3KooWK1KyyDSrrtEcJvjZ56cTsMqtk46FWSuS7HpsGhUowWfh", Mar 7 15:17:18 mina-testworld mina[18179]: "libp2p_port": 8302 Mar 7 15:17:18 mina-testworld mina[18179]: } Mar 7 15:17:18 mina-testworld mina[18179]: } Mar 7 15:17:20 mina-testworld mina[18179]: 2021-03-07 14:17:20 UTC [Info] Received a block from $sender Mar 7 15:17:20 mina-testworld mina[18179]: #011sender: { Mar 7 15:17:20 mina-testworld mina[18179]: "Remote": { Mar 7 15:17:20 mina-testworld mina[18179]: "host": "94.74.101.26", Mar 7 15:17:20 mina-testworld mina[18179]: "peer_id": "12D3KooWNuA1pY5aY8x9nCcj8FqzACYDnXDAD4sK3LWvnrxyZrCd", Mar 7 15:17:20 mina-testworld mina[18179]: "libp2p_port": 1027 Mar 7 15:17:20 mina-testworld mina[18179]: } Mar 7 15:17:20 mina-testworld mina[18179]: } Mar 7 15:17:20 mina-testworld mina[18179]: {"timestamp":"2021-03-07 14:17:20.742036Z","level":"Debug","source":{"module":"Snark_workerFunctor","location" :"File \"src/lib/snark_worker/functor.ml\", line 169, characters 8-20"},"message":"Snark worker working directory $dir","metadata":{"dir":"/","pid":18278,"p rocess":"Snark Worker"}} Mar 7 15:17:20 mina-testworld mina[18179]: {"timestamp":"2021-03-07 14:17:20.742248Z","level":"Debug","source":{"module":"Snark_workerFunctor","location" :"File \"src/lib/snark_worker/functor.ml\", line 181, characters 6-18"},"message":"Snark worker using daemon $addr","metadata":{"addr":"127.0.0.1:8301","pid ":18278,"process":"Snark Worker"}} Mar 7 15:17:23 mina-testworld mina[18179]: Attempted to allocate 136548740755 bytesFatal error: out of memory. Mar 7 15:17:23 mina-testworld systemd[1]: mina.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Mar 7 15:17:24 mina-testworld systemd[1]: mina.service: Failed with result 'exit-code'.
Every 3-10 hours I have a problem restarting the service. After I installed sidecar. attempted to allocate bytes fatal error out of memory
Uptime
Still actual
minaprotocol/mina-archive:1.1.3-48401e9
Attempted to allocate 92152191331 bytesFatal error: out of memory.
+ tail -q -f mina.log
Mina process exited with status code 2
2021-03-27 07:37:37 UTC [Info] Coda daemon is booting up; built with commit "a8893ab6dd8a68171e7b99a5dc6b76940411350b" on branch "master"
Using password from environment variable CODA_PRIVKEY_PASS
2021-03-27 07:37:37 UTC [Info] Created daemon lockfile "/root/.mina-config/.mina-lock"
2021-03-27 07:37:37 UTC [Info] Registering async shutdown handler: "Remove daemon lockfile"
2021-03-27 07:37:37 UTC [Info] Daemon will expire at "2024-12-10 14:00:00-07:00"
2021-03-27 07:37:37 UTC [Info] Booting may take several seconds, please wait
2021-03-27 07:37:37 UTC [Info] Reading configuration files $config_files
config_files: [
"/var/lib/coda/config_a8893ab6.json", "/root/.mina-config/daemon.json",
"/var/lib/coda/config_a8893ab.json"
almost once per 24h on every node I run. (1.1.3-48401e9, 1.1.4-a8893ab)
Attempted to allocate 136854715828 bytesFatal error: out of memory. Mina process exited with status code 2
Another occurrence reported on discord.
https://discord.com/channels/484437221055922177/799597981762453535/825869389995704400
I just had this two hours ago.
!!! 2021-03-29 06:01:16 UTC [Info] Received a block from $sender
!!! sender: {
!!! "Remote": {
!!! "host": "3.236.207.131",
!!! "peer_id": "12D3KooWF162ZD7FNU29or3AMRPMB5pTvG2ZtZJdxciBGLXXNsUy",
!!! "libp2p_port": 8302
!!! }
!!! }
[2021-3-29 06:01:19.171322]Snark_worker__Functor: Snark worker working directory "/home/zbostrom"
[2021-3-29 06:01:19.171418]Snark_worker__Functor: Snark worker using daemon "127.0.0.1:8301"
[2021-3-29 06:01:19.374481]Snark_worker__Functor: No jobs available. Napping for 5.954796458038227 seconds
!!! Attempted to allocate 136557433134 bytesFatal error: out of memory.
Again, just now:
2021-03-29 16:43:08 UTC [Info] Received a block from $sender
sender: {
"Remote": {
"host": "34.75.16.224",
"peer_id": "12D3KooWEdBiTUQqxp3jeuWaZkwiSNcFxC6d6Tdq7u2Lf2ZD2Q6X",
"libp2p_port": 10003
}
}
Attempted to allocate 136586112995 bytesFatal error: out of memory.
Again, about 9 hours ago.
2021-03-30 19:49:29 UTC [Info] Received a block from $sender
sender: {
"Remote": {
"host": "135.181.76.248",
"peer_id": "12D3KooWS7jKsHMuMp8sCjSQVFqU3b9hLwBRHDjjaiUXhYDYkt3v",
"libp2p_port": 8302
}
}
Attempted to allocate 136682221557 bytesFatal error: out of memory.
Again
2021-04-03 11:46:06 UTC [Info] Received a block from $sender
sender: {
"Remote": {
"host": "195.201.173.222",
"peer_id": "12D3KooWDvHJgAF2jyBug5u4R7tWooXAUsdrSKEtVDeq8JnED5cY",
"libp2p_port": 8302
}
}
Attempted to allocate 92661310289 bytesFatal error: out of memory.
@gregbostrom can you share the flags you're using? We've been looking for a way to reproduce this so that we can capture some core dumps.. it seems you've found a pretty reliable one!
mina daemon \
-peer-list-file ~/peers.txt \
-generate-genesis-proof true \
-block-producer-key ~/keys/my-wallet \
-block-producer-password $MINA_PRIVKEY_PASS \
-file-log-level Info \
-log-level Info \
-limited-graphql-port 3095
I think most people encountering this problem are not reporting it and I would not call mine a reliable case.
I just keep reporting it because I think it needs to be fixed. You might consider screening out very very large memory allocation requests and ignore the current request.
I think most people encountering this problem are not reporting it and I would not call mine a reliable case.
I just checked logs for my nodes for the last 7 days and 15 occurrences of this.
Attempted to allocate 137166189101 bytesFatal error: out of memory. Mina process exited with status code 2 2021-04-07 13:17:39 UTC [Info] Coda daemon is booting up; built with commit "a8893ab6dd8a68171e7b99a5dc6b76940411350b" on branch "master" Using password from environment variable CODA_PRIVKEY_PASS
4 crashes. without clean .mina-config not going well.
I think the problem of memory allocation does still exist in version 1.2.0beta1-c856692-mainnet
Possibly more hidden than before
2021-07-02 10:28:28 UTC [Error] Possible reason for signal: "Process killed because out of memory"
Log extract 2021-07-02 10:27:31 UTC [Warn] RPC call error for "get_transition_chain_proof" 2021-07-02 10:27:35 UTC [Error] error sending message on stream 25703: $error error: { "commit_id": "c856692fddc525a673ba075f714811b5c50bd3a7", "string": "RPC #46101 failed: \"only wrote 0 out of 9 bytes error: libp2p error error: closed stream\"" } 2021-07-02 10:27:35 UTC [Warn] RPC call error for "get_transition_chain" 2021-07-02 10:27:41 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] Not rebroadcasting block $state_hash because it was received "1 slots too late" state_hash: "3NLex6ZoBK5FxhxRwPwmEBHLVEycVBf2kWULC9gTcDwPq5wskHQz" 2021-07-02 10:27:51 UTC [Info] Saw block with state hash $state_hash state_hash: "3NLex6ZoBK5FxhxRwPwmEBHLVEycVBf2kWULC9gTcDwPq5wskHQz" 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:51 UTC [Warn] initial_validate: disconnected chain 2021-07-02 10:27:54 UTC [Warn] RPC call error for "get_transition_chain_proof" 2021-07-02 10:27:55 UTC [Info] Received a block from $sender sender: { "Remote": { "host": "34.122.226.192", "peer_id": "12D3KooWAbpbSc9WbfkrJE8FQNLebzL9A7WGUVaMFaEjexW4MPmU", "libp2p_port": 8302 } } 2021-07-02 10:28:05 UTC [Fatal] Unhandled top-level exception: $exn Generating crash report exn: { "commit_id": "c856692fddc525a673ba075f714811b5c50bd3a7", "sexp": [ "monitor.ml.Error", "Cached item has already been finalized", [ "Raised at file \"src/error.ml\" (inlined), line 9, characters 14-30", "Called from file \"src/lib/cache_lib/impl.ml\", line 173, characters 30-51", "Called from file \"src/lib/cache_lib/impl.ml\", line 199, characters 6-69", "Called from file \"src/lib/ledger_catchup/super_catchup.ml\", line 928, characters 25-58", "Called from file \"src/deferred1.ml\", line 17, characters 40-45", "Called from file \"src/job_queue.ml\" (inlined), line 131, characters 2-5", "Called from file \"src/job_queue.ml\", line 171, characters 6-47", "Caught by monitor coda" ] ], "backtrace": [ "Raised at file \"format.ml\" (inlined), line 242, characters 35-52", "Called from file \"format.ml\", line 469, characters 8-33", "Called from file \"format.ml\", line 484, characters 6-24" ] } 2021-07-02 10:28:05 UTC [Info] Updating new available work took 21.804094314575195 ms 2021-07-02 10:28:28 UTC [Error] Daemon child process 97 terminated after receiving signal "sigkill" 2021-07-02 10:28:28 UTC [Error] Possible reason for signal: "Process killed because out of memory" 2021-07-02 10:28:28 UTC [Error] Child process of kind "Prover" with pid 97 has unexpectedly terminated 2021-07-02 10:28:28 UTC [Fatal] Unhandled top-level exception: $exn Generating crash report exn: { "commit_id": "c856692fddc525a673ba075f714811b5c50bd3a7", "sexp": [ "monitor.ml.Error", [ "Failure", "Child process of kind Prover has unexpectedly terminated" ], [ "Raised at file \"stdlib.ml\", line 33, characters 17-33", "Called from file \"src/app/cli/src/cli_entrypoint/mina_cli_entrypoint.ml\", line 579, characters 10-94", "Called from file \"src/deferred0.ml\", line 56, characters 64-69", "Called from file \"src/job_queue.ml\" (inlined), line 131, characters 2-5", "Called from file \"src/job_queue.ml\", line 171, characters 6-47", "Caught by monitor coda" ] ], "backtrace": [ "Raised by primitive operation at file \"src/signal.ml\", line 162, characters 6-61" ] } 2021-07-02 10:28:28 UTC [Error] verifier terminated unexpectedly 2021-07-02 10:28:28 UTC [Info] Starting a new verifier process 2021-07-02 10:28:28 UTC [Info] verifier successfully stopped 2021-07-02 10:28:28 UTC [Info] Rebroadcasting $state_hash state_hash: "3NLnPyR37jEecT8RdmJ6m7HKnNxicr8gj1kKA4x5iZC4nueGevJ4" 2021-07-02 10:28:28 UTC [Fatal] libp2p_helper process died unexpectedly: "died after receiving sigkill (signal number 9)" 2021-07-02 10:28:28 UTC [Error] error during validationComplete, ignoring and continuing: $error error: { "commit_id": "c856692fddc525a673ba075f714811b5c50bd3a7", "string": "helper process already exited (doing RPC {\"seqno\":25735,\"is_valid\":\"accept\"})" }
I'm seeing a lot of these kinds of errors in my logs, not sure if it's related.
2021-07-22 09:44:53 UTC [Error] error sending message on stream 1050: $error
error: {
"commit_id": "a42bdeef6b0c15ee34616e4df76c882b0c5c7c2a",
"string":
"RPC #9486 failed: \"only wrote 0 out of 9 bytes error: libp2p error error: closed stream\""
}
and
2021-07-22 09:31:38 UTC [Warn] verification of blockchain snark failed but it was our fault
2021-07-22 09:31:38 UTC [Error] error sending message on stream 1023: $error
error: {
"commit_id": "a42bdeef6b0c15ee34616e4df76c882b0c5c7c2a",
"string":
"RPC #9282 failed: \"only wrote 0 out of 36 bytes error: libp2p error error: closed stream\""
}
We are on minaprotocol/mina-daemon-baked:1.1.5-a42bdee
and we periodically get Attempted to allocate 136699113500 bytesFatal error: out of memory.
crash.
I see multiple issues open for memory related issues. Is there any main one tracking the problem?
Machine is on:
# free -h
total used free shared buff/cache available
Mem: 125Gi 7.4Gi 51Gi 0.0Ki 66Gi 117Gi
Swap: 4.0Gi 4.0Mi 4.0Gi
Got this today on 1.1.5-a42bdeet:
Aug 24 09:59:40 Ubuntu-1804-bionic-64-minimal mina[2359]: Attempted to allocate 137200417204 bytesFatal error: out of memory.
Aug 24 09:59:41 Ubuntu-1804-bionic-64-minimal systemd[1043]: mina.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 24 09:59:41 Ubuntu-1804-bionic-64-minimal systemd[1043]: mina.service: Failed with result 'exit-code'.
Release 1.2.0beta6-bee023a:
Sep 18 01:33:35 kernel: [583074.116491] [UFW BLOCK] IN=enp1s0 OUT= MAC=68:05:ca:e6:44:a9:5c:5e:ab:d0:66:c0:08:00 SRC=89.248.165.61 DST=209.236.118.26 LEN=40 TOS=0x00 PREC=0x00 TTL=242 ID=13885 PROTO=TCP SPT=43882 DPT=40159 WINDOW=1024 RES=0x00 SYN URGP=0
Sep 18 01:33:43 mina[12832]: Attempted to allocate 15253661536846873 bytesFatal error: out of memory.
Sep 18 01:33:43 systemd[1522]: mina.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
This instance of this issue was reproduced and addressed in a number of prs in late 2020 early 2021. If we see this again, we can reopen.
Description
My node has crashed, but I don't see the crash report in the .coda-config folder :thinking: In logs I see OOM, but according to Grafana the maximum was 17/64 GB
Environment
Steps to reproduce
#qa-task-force
pin messageLOGS
https://drive.google.com/file/d/1chFyQnD3CnlU_AK2Y6ddJZpNJlDvUcD-/view?usp=sharing