crypto-org-chain / cronos

Cronos is the first Ethereum-compatible blockchain network built on Cosmos SDK technology. Cronos aims to massively scale the DeFi, GameFi, and overall Web3 user community by providing builders with the ability to instantly port apps and crypto assets from other chains while benefiting from low transaction fees, high throughput, and fast finality.
Other
290 stars 235 forks source link

cannot query with height in the future; please provide a valid height #747

Open runatyr1 opened 1 year ago

runatyr1 commented 1 year ago

Describe the bug

To Reproduce

Expected behavior

Additional context

Things we have tried

We are using 200.000 not 2.000.000 open file limit (ulimit -n)

Also, observed a possible correlation between uptime and # of open files, like there is some problem and cronos is not closing unused open files and they grow over time:

Running: sudo lsof | wc -l CRONOS-ARCHIVE-1 CRONOS-ARCHIVE-2 CRONOS-ARCHIVE-3 CRONOS-ARCHIVE-4 CRONOS-TESTNET-1 CRONOS-TESTNET-2 CRONOS-TESTNET-3

Results: 683235 361592 14695 367258 135280 134014 135285

Running (on same servers): uptime 08:17:13 up 49 days, 16:04, 1 user, load average: 0.48, 0.37, 0.31 08:17:14 up 7 days, 20:42, 0 users, load average: 0.27, 0.29, 0.28 08:17:16 up 7 days, 21:41, 0 users, load average: 1.13, 1.30, 1.19 08:17:18 up 7 days, 20:14, 0 users, load average: 0.12, 0.15, 0.16 08:17:23 up 96 days, 15:55, 0 users, load average: 0.19, 0.15, 0.11 08:17:24 up 96 days, 15:00, 0 users, load average: 0.11, 0.12, 0.13 08:17:26 up 96 days, 13:27, 0 users, load average: 0.13, 0.24, 0.21

yihuang commented 1 year ago
runatyr1 commented 1 year ago

I will check with my team if we can increase open files by 10x (2 million) Cronos team please help verify the open file amount seems too large and increasing over time If you need more data let me know

runatyr1 commented 1 year ago

@yihuang thanks for writing, we do have db_backend = "rocksdb" on config.toml Not sure how to check db compaction behaviour, if you give me instructions I can check. Something particularly interesting is the height going higher than needed, maybe parallel processing I'm not sure. By the way I also tried disabling fast_sync in one node in config.toml, but didnt help

Your team had shared the code below on discord, which evidently will trigger the height error if its higher than lastBlock, the question is why..

 lastBlockHeight := app.LastBlockHeight()
    if height > lastBlockHeight {
        return sdk.Context{},
            sdkerrors.Wrap(
                sdkerrors.ErrInvalidHeight,
                "cannot query with height in the future; please provide a valid height",
            )
    }
yihuang commented 1 year ago

Something particularly interesting is the height going higher than needed, maybe parallel processing I'm not sure.

The error happens in rpc handler right? Where is the requests come from? and do you know where does the client get the higher height from?

JayT106 commented 1 year ago

db compaction behaviour

The db backend will trigger compaction itself, you can observe it through I/O loading or check LOG in the data folder, i.e. app/data/application.db/LOG. When the db is doing compaction, it will log it to the file, you can know the time spent in compaction and when it happens.

Cronos may opens around 20K ~ 30K files for the rocksdb backend file caching. So there might several issues happens in this thread.

The large file opens might be due to the p2p layer issue, have you seen a lot of peer connections requested and dropped?

One more thing, do you enable snapshot feature in your node? Creating a new snapshot might also stucks your node from syncing with the network.

MustardseedX commented 1 year ago

Ok I am going to need help understanding I need to do what I'm gowing to need a little bit of time and support if I am to help as much of everything you are saying seems very important but has eluded me can you be a bit more spacfic on what I need ot do what programs I need to use whatif anything needs to be replaced or expanded apon where I can make these changes and any other special instructions I will need to make this happen I will do my best but I'm playing catch up while simultaneously being hacked by malicious persons with ill intent. I look forward to hearing from you in regards to this important matter

Sincerely

Xhristian

Semper Fidelis

Mustardseed XIK

On Tue, Oct 25, 2022, 1:13 AM Pietro @.***> wrote:

Describe the bug

  • Cronos nodes fall out of sync
  • We received alerts on different days between 14:20 - 17:30 UTC
  • Then it synced again without any changes, so it looks related to load or related to something else that happens at that time of the day
  • Workload was low during the issue, plenty of hardware resources left
  • The same issue happened on multiple nodes at the same time The error was: cronosd[211409]: 1:11PM INF Served eth_call conn=165.227.220.115:44408 duration=291.995272 err="rpc error: code = Unknown desc = cannot query with height in the future; please provide a valid height: invalid height" module=geth reqid=1

To Reproduce

  • Not sure what to do to reproduce, it seems adding a little load does it but we had to add more nodes and it helped as workaround. So at the moment it is not reproducible on our end, but we are wasting a lot of resources since the nodes were already at 5-10% usage (cpu/mem) and we had to add more.

Expected behavior

  • Node stays synced and it doesn't throw this error

Additional context

  • Last time seen at v0.8.0
  • Current version: binary says v0.8.0 but I think we are on v0.8.1 (I did download once the v0.8.1 binary and it still threw v0.8.0 if I ran the version command)
  • We don't think v0.8.1 or v0.8.2 fixes this since I do not see any reports of this error on the releases cannot query with height in the future
  • We observed this problem on nodes that are on archive mode, we don't have non-archive to compare

Things we have tried

-

We made contact with Anthea on Cronos discord and he suggested to check the system time but they are ok: Local time: Tue 25 Oct 2022 07:57:56 AM UTC (24h 07:57) Nodes time: (ran ssh host date on our nodes) Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022 Tue Oct 25 07:57:57 UTC 2022

Next suggestion is increasing open file limit, looking at that now, it seems there might be an odd behavior with open files. ulimit -n 2000000 ulimit -a

— Reply to this email directly, view it on GitHub https://github.com/crypto-org-chain/cronos/issues/747, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3UTM2NVEBRXGEWECU4AKRDWE6JEBANCNFSM6AAAAAARNWVVKI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

leejw51crypto commented 1 year ago

also, try to use different backend db,