Open zhy827827 opened 1 month ago
Before the node was killed, is it possible to check if there are large values on any label of these prometheus metrics:
monitored_futures
monitored_tasks
rocksdb_block_cache_usage
rocskdb_estimate_table_readers_mem
Please let us know total values and the largest labels on these metrics, and we can check if they indicate issues.
sorry,i don't use prometheus metrics.
It should be straight forward to set grafana up for sui-node
. This post is from awhile ago but could still be useful: https://forums.sui.io/t/monitoring-and-alerts-integration-for-your-node/15449
Alternatively, we can try using jemalloc
and jeprof
to profile the memory. On Ubuntu, you can:
jemalloc
, jeprof
and graphviz
: sudo apt install libjemalloc-dev graphviz
/usr/lib/x86_64-linux-gnu/libjemalloc.so
/opt/sui/jemalloc/
.sui-node
with LD_PRELOAD /usr/lib/x86_64-linux-gnu/libjemalloc.so MALLOC_CONF=prof:true,prof_prefix:/opt/sui/jeprof/jeprof.out,lg_prof_interval:34
. This basically dumps a memory profile after some memory allocations.sui-node
memory usage becomes too high, find the latest profile from ls -1t /opt/sui/jeprof/ | head -20
sui-node
binary that generated it: sudo jeprof --svg /opt/sui/bin/sui-node /opt/sui/jeprof/<file name chosen from above> > jeprof.svg
Posting the jeprof.svg
would help us identify the issue.
Sui Node version: 1.28.4-fc0623927416 hardware
CPU 12C MEM 64G Disk 4T
Start the Sui node, the process will be killed after 2 minutes, and then restart
Logs:
How can I optimize it?
It's strange that my other two servers are running very well. The software and hardware configurations are the same