Closed skutcher closed 1 month ago
it seems btw to work with https://defi-snapshots-europe.s3.eu-central-1.amazonaws.com/snapshot-mainnet-4268284.zip
with this snapshot i have measured up to 2334 open descriptors
with https://defi-snapshots-europe.s3.eu-central-1.amazonaws.com/snapshot-mainnet-4423468.zip
i have measured up to 2532 open descriptors
but this seems not to be the problem, cause the fd count already drops to around 1600 before the error occurs
2024-10-16T20:52:58Z [vsdb_core::common::engines::rocks_db] rocksdb write_buffer_size per column family = 512MB
2024-10-16T20:52:58Z UpdateTip: height=4423469 hash=e41b0298011b8e76244eebe13db5bdd6e6c3a23bfe561fb8112114f98174698e date='2024-10-15T20:00:42Z' tx=34709360 log2_work=88.520461 progress=0.999525
*** buffer overflow detected ***: terminated
*** buffer overflow detected ***: terminated
ps axco pid,comm | grep defid | xargs -n 2 -- bash -c 'echo -n "$1 $2 ";lsof -p $1 2>/dev/null | wc -l' argv0
(Please note that lsof | wc -l sums up a lot of duplicated entries (forked processes can share file handles etc).)
Thanks for the detailed info @skutcher. We're looking at options to workaround this.
For now, please push the kernel limits and ulimits to higher. We've observed anything 16k+ usually works well. We'll look at a more refined fix soon.
yfyi: my default system limit:
limit -Hn
524288
ulimit -Sn
1024
but as mentioned above if i increase the softlimit i got just buffer overflows
but as mentioned above if i increase the softlimit i got just buffer overflows
I assume this after it already failed and corrupted the data due to open files? Once the data is corrupt, all bets are off I'm afraid. Even if it doesn't error with something else, the integrity checks will fail to ensure the node doesn't start with bad data.
This is expected behavior. If you have issues with a clean data / snapshot, please let us know. We do run our machines with 1024 * 1024 or 16k as the file open limits depending on the system.
We do intend to workaround the rocksdb bug, but however, meanwhile the above should be a viable workaround. I've also expanded a bit more on https://github.com/DeFiCh/ain/issues/3101#issuecomment-2436546577
but as mentioned above if i increase the softlimit i got just buffer overflows
I assume this after it already failed and corrupted the data due to open files? Once the data is corrupt, all bets are off I'm afraid. Even if it doesn't error with something else, the integrity checks will fail to ensure the node doesn't start with bad data.
I am pretty sure it was with an fresh snapshot of https://defi-snapshots-europe.s3.eu-central-1.amazonaws.com/snapshot-mainnet-4423468.zip . I will check again to be 100% sure
We were able to test with stricts ulimit < 1024, and now the default should work well with the default of 1024 even though it's recommended to have higher limits.
Believe this issue should be resolved with the latest master.
Closing, as this is now resolved. Please feel free to reopen if you continue to have issues with >= 4.2.1.
Summary
Steps to Reproduce
Environment
RUST_BACKTRACE=full