Open runatyr1 opened 1 year ago
I will check with my team if we can increase open files by 10x (2 million) Cronos team please help verify the open file amount seems too large and increasing over time If you need more data let me know
@yihuang thanks for writing, we do have db_backend = "rocksdb" on config.toml
Not sure how to check db compaction behaviour, if you give me instructions I can check.
Something particularly interesting is the height
going higher than needed, maybe parallel processing I'm not sure. By the way I also tried disabling fast_sync
in one node in config.toml, but didnt help
Your team had shared the code below on discord, which evidently will trigger the height error if its higher than lastBlock, the question is why..
lastBlockHeight := app.LastBlockHeight()
if height > lastBlockHeight {
return sdk.Context{},
sdkerrors.Wrap(
sdkerrors.ErrInvalidHeight,
"cannot query with height in the future; please provide a valid height",
)
}
Something particularly interesting is the height going higher than needed, maybe parallel processing I'm not sure.
The error happens in rpc handler right? Where is the requests come from? and do you know where does the client get the higher height
from?
db compaction behaviour
The db backend will trigger compaction itself, you can observe it through I/O loading or check LOG
in the data folder, i.e. app/data/application.db/LOG
. When the db is doing compaction, it will log it to the file, you can know the time spent in compaction and when it happens.
Cronos may opens around 20K ~ 30K files for the rocksdb backend file caching. So there might several issues happens in this thread.
The large file opens might be due to the p2p layer issue, have you seen a lot of peer connections requested and dropped?
One more thing, do you enable snapshot
feature in your node? Creating a new snapshot might also stucks your node from syncing with the network.
Ok I am going to need help understanding I need to do what I'm gowing to need a little bit of time and support if I am to help as much of everything you are saying seems very important but has eluded me can you be a bit more spacfic on what I need ot do what programs I need to use whatif anything needs to be replaced or expanded apon where I can make these changes and any other special instructions I will need to make this happen I will do my best but I'm playing catch up while simultaneously being hacked by malicious persons with ill intent. I look forward to hearing from you in regards to this important matter
Sincerely
Xhristian
Semper Fidelis
Mustardseed XIK
On Tue, Oct 25, 2022, 1:13 AM Pietro @.***> wrote:
Describe the bug
- Cronos nodes fall out of sync
- We received alerts on different days between 14:20 - 17:30 UTC
- Then it synced again without any changes, so it looks related to load or related to something else that happens at that time of the day
- Workload was low during the issue, plenty of hardware resources left
- The same issue happened on multiple nodes at the same time The error was: cronosd[211409]: 1:11PM INF Served eth_call conn=165.227.220.115:44408 duration=291.995272 err="rpc error: code = Unknown desc = cannot query with height in the future; please provide a valid height: invalid height" module=geth reqid=1
To Reproduce
- Not sure what to do to reproduce, it seems adding a little load does it but we had to add more nodes and it helped as workaround. So at the moment it is not reproducible on our end, but we are wasting a lot of resources since the nodes were already at 5-10% usage (cpu/mem) and we had to add more.
Expected behavior
- Node stays synced and it doesn't throw this error
Additional context
- Last time seen at v0.8.0
- Current version: binary says v0.8.0 but I think we are on v0.8.1 (I did download once the v0.8.1 binary and it still threw v0.8.0 if I ran the version command)
- We don't think v0.8.1 or v0.8.2 fixes this since I do not see any reports of this error on the releases cannot query with height in the future
- We observed this problem on nodes that are on archive mode, we don't have non-archive to compare
Things we have tried
-
We made contact with Anthea on Cronos discord and he suggested to check the system time but they are ok: Local time: Tue 25 Oct 2022 07:57:56 AM UTC (24h 07:57) Nodes time: (ran ssh host date on our nodes) Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022
Next suggestion is increasing open file limit, looking at that now, it seems there might be an odd behavior with open files. ulimit -n 2000000 ulimit -a
— Reply to this email directly, view it on GitHub https://github.com/crypto-org-chain/cronos/issues/747, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3UTM2NVEBRXGEWECU4AKRDWE6JEBANCNFSM6AAAAAARNWVVKI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
also, try to use different backend db,
Describe the bug
cronosd[211409]: 1:11PM INF Served eth_call conn=165.227.220.115:44408 duration=291.995272 err="rpc error: code = Unknown desc = cannot query with height in the future; please provide a valid height: invalid height" module=geth reqid=1
To Reproduce
Expected behavior
Additional context
cannot query with height in the future
Things we have tried
We made contact with Anthea on Cronos discord and he suggested to check the system time but they are ok: Local time: Tue 25 Oct 2022 07:57:56 AM UTC (24h 07:57) Nodes time: (ran ssh host date on our nodes) Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022
Next suggestion is increasing open file limit ulimit -n 2000000 ulimit -a
We are using 200.000 not 2.000.000 open file limit (ulimit -n)
Also, observed a possible correlation between uptime and # of open files, like there is some problem and cronos is not closing unused open files and they grow over time:
Running: sudo lsof | wc -l CRONOS-ARCHIVE-1 CRONOS-ARCHIVE-2 CRONOS-ARCHIVE-3 CRONOS-ARCHIVE-4 CRONOS-TESTNET-1 CRONOS-TESTNET-2 CRONOS-TESTNET-3
Results: 683235 361592 14695 367258 135280 134014 135285
Running (on same servers): uptime 08:17:13 up 49 days, 16:04, 1 user, load average: 0.48, 0.37, 0.31 08:17:14 up 7 days, 20:42, 0 users, load average: 0.27, 0.29, 0.28 08:17:16 up 7 days, 21:41, 0 users, load average: 1.13, 1.30, 1.19 08:17:18 up 7 days, 20:14, 0 users, load average: 0.12, 0.15, 0.16 08:17:23 up 96 days, 15:55, 0 users, load average: 0.19, 0.15, 0.11 08:17:24 up 96 days, 15:00, 0 users, load average: 0.11, 0.12, 0.13 08:17:26 up 96 days, 13:27, 0 users, load average: 0.13, 0.24, 0.21