MystenLabs / sui

Sui, a next-generation smart contract platform with high throughput, low latency, and an asset-oriented programming model powered by the Move programming language
https://sui.io
Apache License 2.0
5.99k stars 11.11k forks source link

Gradual Increase in Memory Usage after Starting Sui Node #19180

Open shaokun11 opened 3 weeks ago

shaokun11 commented 3 weeks ago

Steps to Reproduce Issue

sui start .

Expected Result

The memory usage should stabilize after the initial startup, with no significant continuous increase over time.

Actual Result

The memory usage gradually increases over time, from the memory monitoring point of view, each epoch will increase a little, the current epoch time is set to 6 hours. image

image

I alse flow https://github.com/MystenLabs/sui/issues/18067#issuecomment-2166567908 to update some env, but useless.

System Information

shaokun11 commented 2 weeks ago

I would prefer that the data is not cleared with each restart. The current in-memory approach forces me to restart the node every couple of days. I wanted to check in on the progress of this issue. If there's anything I can do to assist or provide further information, please don't hesitate to reach out. Thank you! image

stefan-mysten commented 4 days ago

Thanks @shaokun11 for reporting this and for your patience with the slow reply (I was OOO for a few weeks and we somehow missed this issue until now).

sui start should not clear the data with each restart, unless the --force-regenesis flag is used. Can you please confirm that indeed after you stop the network and restart it with sui start the data is lost? What is the command you are using?

Can you try to update your Sui CLI and see if the memory still increases so much in two days? Depending on which version you need, mainnet last release is here: https://github.com/MystenLabs/sui/releases/tag/mainnet-v1.32.2 and testnet last release is here: https://github.com/MystenLabs/sui/releases/tag/testnet-v1.33.1

shaokun11 commented 1 day ago

@ronny-mysten Thank you for your guidance. After upgrading to the testnet-v1.33.1 of Sui, I am still experiencing new issue when I use sui start. The OS version is aws ec2 r6i.24xlarge ubuntu22.04 Is there anything else I need to do to upgrade the new version? Or just replace it with a new version of binary image

stefan-mysten commented 1 day ago

@shaokun11 just to clarify, are you still experiencing memory problems, or are you referring to the ERROR message in the logs? If you are referring to the ERROR message in the logs: ERROR mysten_metrics::thread_stall_monitor, then do not worry too much about that one.

If after a day of running sui start the memory still grows fast, then please let me know. It would be good to also share what's the purpose of starting a local network on a AWS machine - to understand better what's the workflow, the end goal, and see if we can advise you to go a different route.

Thanks!

shaokun11 commented 1 day ago

@stefan-mysten We currently want to launch a sui network locally to do some development. So a stable version is all my need.Now this node has been running for nearly 2 months, the only problem is that each time a new epoch is generated, the memory increases by 30g+, and we have to restart it every two days

After starting with sui start, the error occurred a few moments later, it did not continue to sync.You can find the complete log file at 1.log

ProtocolVersion(52) Boot counter: 0 thread '2024-09-20T05:33:42.389732Z ERROR node{name=k#8dcff6d1..}: telemetry_subscribers: panicked at /home/ubuntu/sui/crates/sui-core/src/checkpoints/mo d.rs:791:21: transaction TransactionDigest(3YPdHDGJNG2dAT9ggXNaBqsVnZdC1pzvQTWbo37eNKw6) not found panic.file="/home/ubuntu/sui/crates/sui-core/src/checkpoints/mod.rs " panic.line=791 panic.column=21 k#8dcff6d1..' panicked at /home/ubuntu/sui/crates/sui-core/src/checkpoints/mod.rs:791:21: transaction TransactionDigest(3YPdHDGJNG2dAT9ggXNaBqsVnZdC1pzvQTWbo37eNKw6) not found note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

1.log

image

stefan-mysten commented 19 hours ago

Thanks @shaokun11 for all the details, this is very helpful. Regarding the memory issue, I shared it with my colleagues. For the sui start not syncing, I will try locally to start / stop the DB and see if I can reproduce the issue. Worst case scenario, I would suggest to try another version to see if you can restart the network from whatever you have in the local DB.

shaokun11 commented 9 hours ago

Thank you, @stefan-mysten, for your help on this issue!

Currently, I've tested testnet-1.29.2 (the original version I started with), and it continues to sync, but the memory usage keeps increasing. Next, I will be test other versions to check if they can continue syncing and if the memory increase issue is resolved. I will share any updates here as soon as I have new findings.

stefan-mysten commented 7 hours ago

Thanks for your patience here @shaokun11 I want to try locally as well but haven't got a chance yet. Hopefully I can find some time in the weekend. Thanks again!