MystenLabs / sui

Sui, a next-generation smart contract platform with high throughput, low latency, and an asset-oriented programming model powered by the Move programming language
https://sui.io
Apache License 2.0
6.09k stars 11.16k forks source link

Sui oom-kill #19832

Open zhy827827 opened 4 days ago

zhy827827 commented 4 days ago

After updating the Sui new version, the sui node is very unstable and often experiences oom kill Previously, servers with 64GB of memory could run smoothly, but now servers with 128GB of memory are all oom-kill

Oct  8 22:16:11 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct  9 17:41:34 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 00:48:31 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 05:27:15 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 09:35:03 rockx-mainnet-merlin-sg-01 kernel: [15035328.923887] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 10 09:35:07 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 12:53:18 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 16:42:08 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 18:58:15 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 19:56:33 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 21:18:20 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 22:12:38 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 23:03:57 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 23:56:50 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 00:30:36 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 01:01:02 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 01:53:52 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:04:53 rockx-mainnet-merlin-sg-01 kernel: [15098318.993834] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:04:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:34:27 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:58:24 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 04:55:38 rockx-mainnet-merlin-sg-01 kernel: [15104964.510899] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 05:03:19 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 05:38:30 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 06:15:05 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 07:51:47 rockx-mainnet-merlin-sg-01 kernel: [ 1883.157041] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 07:51:53 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 08:19:59 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 09:06:43 rockx-mainnet-merlin-sg-01 kernel: [ 6378.583779] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 10:36:48 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 13:30:27 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 15:59:22 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 17:34:36 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 18:42:59 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 19:31:25 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 20:19:29 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 21:49:21 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 23:28:27 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 12 01:08:12 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL

Oct 12 03:13:51 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: A process of this unit has been killed by the OOM killer. Oct 12 03:13:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL Oct 12 03:13:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Failed with result 'oom-kill'. Oct 12 03:13:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Consumed 1h 26min 28.180s CPU time.

stefan-mysten commented 4 days ago

@zhy827827 which version are you trying to run?

zhy827827 commented 4 days ago

@stefan-mysten Run SUI can only use latest

full.yml:

authority-store-pruning-config:
  num-latest-epoch-dbs-to-retain: 3
  epoch-db-pruning-period-secs: 3600
  num-epochs-to-retain: 0
  max-checkpoints-in-batch: 10
  max-transactions-in-batch: 1000
  #use-range-deletion: true
  pruning-run-delay-seconds: 60
  num-epochs-to-retain-for-checkpoints: 2
  periodic-compaction-threshold-days: 1
  smooth: true
AndyCYB commented 3 days ago

How is the progress now? I encountered the same problem. Is there a solution?

zhy827827 commented 2 days ago

Is the TPS performance improved? https://suiscan.xyz/mainnet/analytics/cps

mwtian commented 1 day ago

For folks having memory growth issues, can you follow https://gist.github.com/mwtian/0f473325a1ad5a74982fcf91737653b4 and upload the heap profile (and metrics if there are interesting findings)? cc @AndyCYB @zhy827827

zhy827827 commented 1 day ago

sui-oom.txt I have collected the data and I don't know if it is useful sui-monitored.txt

mwtian commented 1 day ago

Thanks a lot @zhy827827. Is it possible to take the memory profile as well?

zhy827827 commented 1 day ago

I am still Learn how to get the document of memory files, and I will not use it yet

mwtian commented 1 day ago

And to confirm, is your fullnode running in asia?

zhy827827 commented 1 day ago

yes!

mwtian commented 1 day ago

Interesting. We saw another instance of memory growth from fullnodes running in Asia as well.

zhy827827 commented 1 day ago

Yes, we have two servers, one with 128GB of RAM and one with 64GB of RAM. Servers with 64GB of RAM haven't been able to run at all recently because they've been on the oom