eclipse / omr

Eclipse OMR™ Cross platform components for building reliable, high performance language runtimes
http://www.eclipse.org/omr
Other
939 stars 394 forks source link

Intermittent RISC-V CI failures #6905

Open janvrany opened 1 year ago

janvrany commented 1 year ago

For some time, we're experiencing itermittent failures with RISC-V CI cross-compiling job, see for example #6706 or #6704.
This issue is create to track progress on stabilising RISC-V CI cross-compiling job.

janvrany commented 1 year ago

FYI: @AdamBrousseau

janvrany commented 1 year ago

I have built Debian 10 (buster - this is what CI is running as far as I can tell) and Debian 11 (bullseye) images as similar to CI build node as I could and run few tests there:

  1. When I run the image as container (using systemd-nspawn) a number of tests are failing rather wildly, segfaults, aborts, failures - omrrastest, omrthreadtest, porttest, irrespective of what QEMU version is used (tried with 7.2.0, 7.0.0, 6.0.0, 5.0.0 all built on the image from source).

  2. When I run the image as a real VM (using KVM as hypervisor), only PortSignalExtendedTests.sig_ext_test1 is failing. Again tried with QEMU 7.2.0, 7.0.0, 6.0.0, 5.0.0.

  3. TRIL tests takes huge amount of memory (>8GB) causing system to swap a lot, eventually causing timeouts. This has been observed on Eclipse CI as well.

@AdamBrousseau Is the CI node (deb10-x64-1) a full VM? In any case, what I observed above is not consistent with what can be seen on CI node...

AdamBrousseau commented 1 year ago
Linux deb10-x64-1 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64

Yes, the debian machine is a vm running on kvm.

janvrany commented 1 year ago

6913 has been merged but it did not help much. Now it hangs up in SanityTest, see for example:

When this happened, the build node was not swapping and CPU usage was low - qemu process simply hang.

I'm running out of ideas. It's hard to reproduce for me - essentially I see this kind of failure only on Eclipse CI. QEMU does not implement "extended remote" protocol so one cannot debug multi-threaded programs running under user-mode emulation with QEMU. I can try to see where in QEMU it hangs, but not sure how useful this would be. If anyone has an idea how to approach this, I'm one big ear.

janvrany commented 1 year ago

For the record, I run only threadtest as follows:

`QEMU=...
for in in `seq 1 100`; do 
   $QEMU "-L" "/home/jenkins/riscv-debian/rootfs" "/tmp/omr/build/fvtest/threadtest/omrthreadtest" "--gtest_output=xml:/tmp/omr/build/fvtest/threadtest/omrthreadtest-results.xml" "--gtest_filter=-PriorityInterrupt.*:RWMutex*" | perl -pe 'use POSIX strftime; print strftime "[%Y-%m-%d %H:%M:%S] ", localtime'; 
done

and:

I tried on Eclipse CI node with the bit-identical static QEMU binary copied from my (working) deb10 image to no avail, still hangs.

janvrany commented 1 year ago

Another observation: when I replaced currently used sysroot on CI node with "fresh" sysroot, hangs are lot less frequent (one in ~25 compared to 1 in ~3), but it still hangs.

I also tried running tests on my freshly build deb10 image with sysroot copied from CI node - didn't hang once in 100 runs.

I also noticed that the uptime of CI node (deb10-x64-1) is > 300days. @AdamBrousseau: how much hassle is to reboot the whole deb10-x64-1 VM? I know it should not matter, but I have no idea and just trying different things.

Anyways, I'll try to update sysroot on CI node to "fresh" (with newer versions of libraries, most notably glibc) as it might to reduce hangups (but not fix them).

AdamBrousseau commented 1 year ago

Rebooted. Let me know how it goes.

janvrany commented 1 year ago

@AdamBrousseau: Unfortunately, reboot did not help, but thanks anyway!

I'm going to update qemu-riscv64 to 7.2.0 (from 5.0.0) and sysroot on deb10-x64-1 as it seems that with this combination hangups are less likely.

janvrany commented 1 year ago

I'm going to update qemu-riscv64 to 7.2.0 (from 5.0.0) and sysroot on deb10-x64-1

I did it but the test build hanged just like before. Maybe just bad luck, but when I tried to build ant run test manually it was far more stable. Anyways, I'm going to keep new versions there for some time (I left backups on the node so reverting is a matter of redirecting symlinks back)

janvrany commented 1 year ago

With new sysroot, PortSignalExtendedTestsis failing, here's a fix: #6938

janvrany commented 9 months ago

Just adding reference to PR #6912 as it might help with this. Maybe.