SaitoTech / saito-rust-workspace

Cargo workspace which contains all the rust projects for Saito
4 stars 3 forks source link

Identify reasons for memory spikes and crashes #712

Closed arpee closed 3 months ago

arpee commented 6 months ago

Image

We continue to have sudden increases in memory use during operations. We need to find a way to identify these so we can log and address them.

These spikes sometimes result in crashes due to out of memory problems.

arpee commented 6 months ago

Image

FYI Node Crash.

root@prod: grep -B 20 NODE logs/saito.log 
2024-03-29T04:24:41: on new block : 3fb63b93a3fa9f79ab15051412f494dc8f7041b64c744
f7b370f1532d91c1ac1
2024-03-29T04:24:41: #################
2024-03-29T04:24:41: ### New Block ### 34717
2024-03-29T04:24:41: #################
2024-03-29T04:24:41: $
2024-03-29T04:24:41: $
2024-03-29T04:24:41: $
2024-03-29T04:24:41: $ PURGING ARCHIVE BLOCK
2024-03-29T04:24:41: $
2024-03-29T04:24:41: $
2024-03-29T04:24:41: $
2024-03-29T04:24:41: #################
2024-03-29T04:24:41: ### New Block ### 34718
2024-03-29T04:24:41: #################
2024-03-29T04:24:41: onAddBlockSuccess : bfef9a6f1d2f794a5b530e1c130a28cbebc8eca0
ca69be86b09cda5a87676264 complete
2024-03-29T04:24:41: onAddBlockSuccess : 3fb63b93a3fa9f79ab15051412f494dc8f7041b6
4c744f7b370f1532d91c1ac1 complete
2024-03-29T04:24:41: Sending heartbeat spam tx: 6370
2024-03-29T04:28:41: Killed
2024-03-29T04:28:42: 
2024-03-29T04:28:42: > saito@0.2.0 start
2024-03-29T04:28:42: > NODE_OPTIONS='--max-old-space-size=8192' ts-node -T --file
s start.ts --

Process killed by out of memory system.

arpee commented 4 months ago

More info from discussion with Sanka in TG.

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:04 AM]

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:05 AM] So - we had another crash... and I was trying to find the logs... and saw this.

Sanka Darshana, [13/5/2024 10:05 AM] I'll reduce the logs. I think I added too much logs to find that block fetch issue

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:05 AM] So the crash happened at 2024-05-12 just before 6am.

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:06 AM] So in a couple of minutes we produced 200MB of logs.

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:06 AM] I actually think that's going to tell us a lot.

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:07 AM] So maybe don't :)

Sanka Darshana, [13/5/2024 10:07 AM] ok. 😃 I thought you sent the screenshot to say too much logs

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:17 AM] Not this time.

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:17 AM] 1 2024-05-12T05:51:13 4 2024-05-12T05:51:16 37 2024-05-12T05:51:23 761 2024-05-12T05:51:24 8345 2024-05-12T05:51:25 2 2024-05-12T05:51:32 2 2024-05-12T05:54:41 47956 2024-05-12T05:54:43 103251 2024-05-12T05:54:44 120582 2024-05-12T05:54:45 122808 2024-05-12T05:54:46 85694 2024-05-12T05:54:47 74208 2024-05-12T05:54:48 74738 2024-05-12T05:54:49 48935 2024-05-12T05:54:50 73978 2024-05-12T05:54:51 64476 2024-05-12T05:54:52 72551 2024-05-12T05:54:53 66930 2024-05-12T05:54:54 75967 2024-05-12T05:54:55 86627 2024-05-12T05:54:56 90548 2024-05-12T05:54:57 111161 2024-05-12T05:54:58 107058 2024-05-12T05:54:59 60923 2024-05-12T05:55:00 63820 2024-05-12T05:55:01 49558 2024-05-12T05:55:02 65815 2024-05-12T05:55:03 77003 2024-05-12T05:55:04 72822 2024-05-12T05:55:05 75520 2024-05-12T05:55:06 88433 2024-05-12T05:55:07 93899 2024-05-12T05:55:08 90639 2024-05-12T05:55:09 53156 2024-05-12T05:55:10 49152 2024-05-12T05:55:11 36864 2024-05-12T05:55:12 40073 2024-05-12T05:55:13 15223 2024-05-12T05:55:14

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:17 AM] So this is the number of rows of log output per second... on the node...

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:18 AM] above that first row it's between 1 and 100 most of the time.

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:18 AM] Looking around that time...

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:18 AM] I see a reorg start.

Sanka Darshana, [13/5/2024 10:19 AM] yes. current guess is reorg is causing this

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:20 AM] 2 2024-05-12T05:42:33 2 2024-05-12T05:42:46 2 2024-05-12T05:43:00 2 2024-05-12T05:43:15 2 2024-05-12T05:43:59 2 2024-05-12T05:44:30 2 2024-05-12T05:44:31 2 2024-05-12T05:44:44 2 2024-05-12T05:46:06 342 2024-05-12T05:51:25 4492 2024-05-12T05:54:43 9189 2024-05-12T05:54:44 10347 2024-05-12T05:54:45 11300 2024-05-12T05:54:46 7246 2024-05-12T05:54:47 6708 2024-05-12T05:54:48 6725 2024-05-12T05:54:49 3954 2024-05-12T05:54:50 6635 2024-05-12T05:54:51 5640 2024-05-12T05:54:52 6502 2024-05-12T05:54:53 5955 2024-05-12T05:54:54 6845 2024-05-12T05:54:55 7437 2024-05-12T05:54:56 8345 2024-05-12T05:54:57 9616 2024-05-12T05:54:58 9238 2024-05-12T05:54:59 5739 2024-05-12T05:55:00 5739 2024-05-12T05:55:01 4025 2024-05-12T05:55:02 5755 2024-05-12T05:55:03 7010 2024-05-12T05:55:04 6507 2024-05-12T05:55:05 6790 2024-05-12T05:55:06 7538 2024-05-12T05:55:07 8127 2024-05-12T05:55:08 8435 2024-05-12T05:55:09 4400 2024-05-12T05:55:10 4615 2024-05-12T05:55:11 3035 2024-05-12T05:55:12 3277 2024-05-12T05:55:13 1624 2024-05-12T05:55:14

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 10:20 AM] So, that's the same thing - but filtering for rows that contain 'reorg'....

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 11:28 AM] right.

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 11:28 AM] maybe I can try to bottle it locally...

Sanka Darshana, [13/5/2024 11:29 AM] I downloaded the prod blocks to my laptop. will test with it too

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 11:30 AM] yeah - thinking about how to create a browser that has the wrong chain without it resetting itself...

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 11:30 AM] will test that first.

Sanka Darshana, [13/5/2024 11:31 AM] this should be easy I think. just nuke the server ?

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 11:32 AM] yes - plan is to start node locally and connect a browser... get it a few blocks... then restart the node but with the blocks from prod... then reopen the browser...

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 11:45 AM] well - that didn't trigger anything.

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 12:04 PM] So... i did a few tests without success.

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 12:05 PM] I tried:

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 12:06 PM] And some variations - no crazy load on my laptop.

Sanka Darshana, [13/5/2024 12:51 PM] these things take a bit of time. fixing them is quick if I know how to recreate them

Richard Parris 🟥 (Notifications off by default)., [13/5/2024 1:16 PM] en...

Sanka Darshana, [13/5/2024 1:24 PM] it probably is a 2 different chains syncing.

Sanka Darshana, [13/5/2024 1:24 PM] or there are some inefficiencies in re org code which loads a lot of blocks in to memory

Sanka Darshana, [13/5/2024 1:28 PM] ok. full block count is 9. so not the blocks in memory