cannot query with height in the future; please provide a valid height

runatyr1 commented 1 year ago

Describe the bug

Cronos nodes fall out of sync
We received alerts on different days between 14:20 - 17:30 UTC
Then it synced again without any changes, so it looks related to load or related to something else that happens at that time of the day
Workload was low during the issue, plenty of hardware resources left
The same issue happened on multiple nodes at the same time
The error that printed many times at similar times that the out of sync alert in out systems was: cronosd[211409]: 1:11PM INF Served eth_call conn=165.227.220.115:44408 duration=291.995272 err="rpc error: code = Unknown desc = cannot query with height in the future; please provide a valid height: invalid height" module=geth reqid=1

To Reproduce

Not sure what to do to reproduce, it seems adding a little load does it but we had to add more nodes and it helped as workaround. So at the moment it is not reproducible on our end, but we are wasting a lot of resources since the nodes were already at 5-10% usage (cpu/mem) and we had to add more.

Expected behavior

Node stays synced and it doesn't throw this error

Additional context

Last time seen at v0.8.0
Current version: binary says v0.8.0 but I think we are on v0.8.1 (I did download once the v0.8.1 binary and it still threw v0.8.0 if I ran the version command)
We don't think v0.8.1 or v0.8.2 fixes this since I do not see any reports of this error on the releases cannot query with height in the future
We observed this problem on nodes that are on archive mode, we don't have non-archive to compare

Things we have tried

We made contact with Anthea on Cronos discord and he suggested to check the system time but they are ok: Local time: Tue 25 Oct 2022 07:57:56 AM UTC (24h 07:57) Nodes time: (ran ssh host date on our nodes) Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022
Next suggestion is increasing open file limit ulimit -n 2000000 ulimit -a

We are using 200.000 not 2.000.000 open file limit (ulimit -n)

Also, observed a possible correlation between uptime and # of open files, like there is some problem and cronos is not closing unused open files and they grow over time:

Running: sudo lsof | wc -l CRONOS-ARCHIVE-1 CRONOS-ARCHIVE-2 CRONOS-ARCHIVE-3 CRONOS-ARCHIVE-4 CRONOS-TESTNET-1 CRONOS-TESTNET-2 CRONOS-TESTNET-3

Results: 683235 361592 14695 367258 135280 134014 135285

Running (on same servers): uptime 08:17:13 up 49 days, 16:04, 1 user, load average: 0.48, 0.37, 0.31 08:17:14 up 7 days, 20:42, 0 users, load average: 0.27, 0.29, 0.28 08:17:16 up 7 days, 21:41, 0 users, load average: 1.13, 1.30, 1.19 08:17:18 up 7 days, 20:14, 0 users, load average: 0.12, 0.15, 0.16 08:17:23 up 96 days, 15:55, 0 users, load average: 0.19, 0.15, 0.11 08:17:24 up 96 days, 15:00, 0 users, load average: 0.11, 0.12, 0.13 08:17:26 up 96 days, 13:27, 0 users, load average: 0.13, 0.24, 0.21

yihuang commented 1 year ago

do you use rocksdb backend?
could it due to db compaction?

runatyr1 commented 1 year ago

I will check with my team if we can increase open files by 10x (2 million) Cronos team please help verify the open file amount seems too large and increasing over time If you need more data let me know

runatyr1 commented 1 year ago

@yihuang thanks for writing, we do have db_backend = "rocksdb" on config.toml Not sure how to check db compaction behaviour, if you give me instructions I can check. Something particularly interesting is the height going higher than needed, maybe parallel processing I'm not sure. By the way I also tried disabling fast_sync in one node in config.toml, but didnt help

Your team had shared the code below on discord, which evidently will trigger the height error if its higher than lastBlock, the question is why..

 lastBlockHeight := app.LastBlockHeight()
    if height > lastBlockHeight {
        return sdk.Context{},
            sdkerrors.Wrap(
                sdkerrors.ErrInvalidHeight,
                "cannot query with height in the future; please provide a valid height",
            )
    }

yihuang commented 1 year ago

Something particularly interesting is the height going higher than needed, maybe parallel processing I'm not sure.

The error happens in rpc handler right? Where is the requests come from? and do you know where does the client get the higher height from?

JayT106 commented 1 year ago

db compaction behaviour

The db backend will trigger compaction itself, you can observe it through I/O loading or check LOG in the data folder, i.e. app/data/application.db/LOG. When the db is doing compaction, it will log it to the file, you can know the time spent in compaction and when it happens.

Cronos may opens around 20K ~ 30K files for the rocksdb backend file caching. So there might several issues happens in this thread.

The large file opens might be due to the p2p layer issue, have you seen a lot of peer connections requested and dropped?

One more thing, do you enable snapshot feature in your node? Creating a new snapshot might also stucks your node from syncing with the network.

MustardseedX commented 1 year ago

Ok I am going to need help understanding I need to do what I'm gowing to need a little bit of time and support if I am to help as much of everything you are saying seems very important but has eluded me can you be a bit more spacfic on what I need ot do what programs I need to use whatif anything needs to be replaced or expanded apon where I can make these changes and any other special instructions I will need to make this happen I will do my best but I'm playing catch up while simultaneously being hacked by malicious persons with ill intent. I look forward to hearing from you in regards to this important matter

Sincerely

Xhristian

Semper Fidelis

Mustardseed XIK

On Tue, Oct 25, 2022, 1:13 AM Pietro @.***> wrote:

Describe the bug

Cronos nodes fall out of sync

We received alerts on different days between 14:20 - 17:30 UTC

Then it synced again without any changes, so it looks related to load or related to something else that happens at that time of the day

Workload was low during the issue, plenty of hardware resources left

The same issue happened on multiple nodes at the same time The error was: cronosd[211409]: 1:11PM INF Served eth_call conn=165.227.220.115:44408 duration=291.995272 err="rpc error: code = Unknown desc = cannot query with height in the future; please provide a valid height: invalid height" module=geth reqid=1

To Reproduce

Not sure what to do to reproduce, it seems adding a little load does it but we had to add more nodes and it helped as workaround. So at the moment it is not reproducible on our end, but we are wasting a lot of resources since the nodes were already at 5-10% usage (cpu/mem) and we had to add more.

Expected behavior

Node stays synced and it doesn't throw this error

Additional context

Last time seen at v0.8.0

Current version: binary says v0.8.0 but I think we are on v0.8.1 (I did download once the v0.8.1 binary and it still threw v0.8.0 if I ran the version command)

We don't think v0.8.1 or v0.8.2 fixes this since I do not see any reports of this error on the releases cannot query with height in the future

We observed this problem on nodes that are on archive mode, we don't have non-archive to compare

Things we have tried

-

We made contact with Anthea on Cronos discord and he suggested to check the system time but they are ok: Local time: Tue 25 Oct 2022 07:57:56 AM UTC (24h 07:57) Nodes time: (ran ssh host date on our nodes) Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022

Next suggestion is increasing open file limit, looking at that now, it seems there might be an odd behavior with open files. ulimit -n 2000000 ulimit -a

— Reply to this email directly, view it on GitHub https://github.com/crypto-org-chain/cronos/issues/747, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3UTM2NVEBRXGEWECU4AKRDWE6JEBANCNFSM6AAAAAARNWVVKI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

leejw51crypto commented 1 year ago

also, try to use different backend db,

rocksdb
goleveldb to reduce problem area

crypto-org-chain / cronos

cannot query with height in the future; please provide a valid height #747