erigontech / erigon

Ethereum implementation on the efficiency frontier https://erigon.gitbook.io
GNU Lesser General Public License v3.0
3.14k stars 1.12k forks source link

An error on Github CI while run some erigon hive tests #11834

Closed lystopad closed 1 month ago

lystopad commented 2 months ago

System information

It happens with the latest master as well as with v2.60.4

OS & Version: Ubuntu 16GB RAM (Kernel Version: 6.5.0-1025-azure)

Commit hash: 68f41969f9165ae608a77b754b69116eec247b27

Erigon Command (with flags/config):

Consensus Layer:

Consensus Layer Command (with flags/config):

Chain/Network:

Quoting a message from the partners

Hi, Erigon team We have faced with a strange error on Github CI while run some erigon hive tests. After 30-60 minutes, the job fails with error code 143 (aborted). Example of such fail: xx-xx-xx-xx/job/29501828480

It is hard to debug the issue because in most cases we even cannot get github actions logs (. Currently we know that:

  1. It never happens with nethermind (even after 4.5 hours of run)
  2. Short suites (less than 30 minutes) works well
  3. The suite passes locally without any issues (I tested on linux mint with 8 vCPUs and 32 RAM)
  4. It happens with the latest master as well as with v2.60.4 (without --sync.parallel-state-flushing=false )
  5. There are no any errors on hive / erigon side

Google says that it may be related to CPU or RAM usage https://github.com/actions/runner-images/issues/6680 I assume RAM is ok (github ranner has 16 GB RAM), so maybe the problem relates to CPU.

If so, is there a way to decrease CPU usage inside docker container? I know it is possible to do on docker side, but it is not easy with hive, so maybe there is a way to do it on erigon side? Also, any insights about how to debug the issue will be grateful.

More details could be found in internal messanger in "erigon3" channel.

lystopad commented 2 months ago

Update from the partner:

We have run the same workflows against our self-hosted runners:

So, looks like the issue is not related to the number of RAM.

lystopad commented 2 months ago

One more update

Also, I compared the configuration

Github Standard Runner

Kernel Version: 6.5.0-1025-azure
   Operating System: Ubuntu ***.04.4 LTS
   OSType: linux
   Architecture: x86_64
   CPUs: 4
   Total Memory: 15.61GiB

Self-hosted runner

Kernel Version: 6.8.0-41-generic
   Operating System: Ubuntu 24.04.1 LTS
   OSType: linux
   Architecture: x86_64
   CPUs: 4
   Total Memory: 7.755GiB

The difference is Kernel Version. Maybe the problem with github runner's ubuntu

lystopad commented 2 months ago

One more update:

Updates regarding Erigon issue with workflow canceling: changing ubuntu version did not help. Tested on: ubuntu-latest (22.04), ubuntu-24.04, ubuntu-20.04

Test passed on a self-hosted runner:

   Kernel Version: 6.8.0-41-generic
   Operating System: Ubuntu 24.04.1 LTS
   OSType: linux
   Architecture: x86_64
   CPUs: 4
   Total Memory: 7.755GiB
somnathb1 commented 2 months ago

It seems to me that the issue is not related to performance or resources. I have tried running one of the failing workflows, namely, dashboard_erigon_withdrawals.yml and it runs fine for me - https://github.com/somnathb1/hive/actions/runs/10728609938/job/29753487168 I have also tried running several instances of the hive tests in parallel on my local with low overall resource usage.

somnathb1 commented 1 month ago

The issue was intermittently only appearing on some github runners. Issues related to the main branch for hive failures aren't related to CI and has a separate issue. Closing for now.