FactomProject / factomd

Factom Daemon
https://www.factomprotocol.org/
Other
201 stars 92 forks source link

Out of memory when building FastBoot file. Is that process suposed to consume lots of memory? #1062

Closed PaulBernier closed 3 years ago

PaulBernier commented 3 years ago

Factomd version: https://github.com/WhoSoup/factomd/tree/dev_wax_merge commit 4c31a6e82d74ae10f6fc822b89e661f397969b95

After deleting the FastBoost_custom_v13 file and starting the node, it seems to have crashed with an out of memory error (logs attached) when rebuilding that file. The node has 4Gb of memory. Why is this process using so much memory?

crash.log

WhoSoup commented 3 years ago

The code responsible for this is in state/loadDatabase.go if you want to check it out.

The way it works is:

  1. Start at the last saved height
  2. Load the DBlock, FBlock, ECBlock, ABlock, and EBlocks (and entries for required chains) for that height from the database
  3. Package all that data into a DBStateMsg
  4. Send the DBStateMsg to the MsgQueue

Now, there are two important factors in play here: A. Between the heights 81,000 and 82,000, there were two load tests and that takes up quite a bit of space. For example, the block 81,753 has 7,105 entries and 7,093 entry commits. Just the raw data itself without any overhead is 1.17MiB (~.93MiB of that is commits). Each individual commit/entry exists as a golang struct, so there is significant RAM overhead to store that data.

B. Golang has cooperative scheduling and the size of the MsgQueue where the DBStateMsgs are being sent has a size of 10,000. This means that Golang would just keep running that loop and stuffing more messages into the queue, even if there is no one reading it on the other side.

So it's certainly possible that it would take up a lot of RAM (particularly if the node only has a single core). There are definitely some changes we can make to mitigate this, such as throttling the message loop if the queue starts filling up, or redesigning this process to use a smaller queue / blocking queue.

WhoSoup commented 3 years ago

After further investigation, it seems the DBStates are pumped directly into the smaller message queue (size 50), rather than the big message queue (size 10,000), so it should at most load 50 DBStates into memory at a time. Looks like the problem may be somewhere else.

WhoSoup commented 3 years ago

Just tried investigating this again to see if I could spot the cause via pprof but I can't reproduce it. Ran the test on two separate machines:

Machine One (32GB ram)

Stayed pretty consistent at around 256MB between blocks 0 and 105,000, then jumped up to ~675MB for the remainder.

Machine Two (RPI4 w/ 2GB ram)

Watched this one closer. Stayed around 275MB system memory during the first 19,000 blocks, then slowly climbed up to around 375MB and back down to 315MB by 80k, then up to ~600MB around 128,000.

I am using a different build to you (the latest), maybe that makes a difference, or it was something else causing it. Was the machine used exclusively for factomd or something running in the background? Can you reproduce it?

Edit: Just ran it again on the same build (4c31a6e82d74ae10f6fc822b89e661f397969b95) and it turned out the same.

PaulBernier commented 3 years ago

I also deployed the latest v6.9.0-rc1 that rebuilds the fast boost v14 on the same machine I initially based this report, and this time it didn't crash. Given that and your investigation I am going to close this issue. Thank you for looking into he details.