Open janvrany opened 1 year ago
FYI: @AdamBrousseau
I have built Debian 10 (buster - this is what CI is running as far as I can tell) and Debian 11 (bullseye) images as similar to CI build node as I could and run few tests there:
When I run the image as container (using systemd-nspawn
) a number of tests are failing rather wildly, segfaults, aborts, failures - omrrastest
, omrthreadtest
, porttest
, irrespective of what QEMU version is used (tried with 7.2.0, 7.0.0, 6.0.0, 5.0.0 all built on the image from source).
When I run the image as a real VM (using KVM as hypervisor), only PortSignalExtendedTests.sig_ext_test1
is failing. Again tried with QEMU 7.2.0, 7.0.0, 6.0.0, 5.0.0.
TRIL tests takes huge amount of memory (>8GB) causing system to swap a lot, eventually causing timeouts. This has been observed on Eclipse CI as well.
@AdamBrousseau Is the CI node (deb10-x64-1
) a full VM?
In any case, what I observed above is not consistent with what can be seen on CI node...
Linux deb10-x64-1 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64
Yes, the debian machine is a vm running on kvm.
SanityTest
, see for example:When this happened, the build node was not swapping and CPU usage was low - qemu process simply hang.
I'm running out of ideas. It's hard to reproduce for me - essentially I see this kind of failure only on Eclipse CI. QEMU does not implement "extended remote" protocol so one cannot debug multi-threaded programs running under user-mode emulation with QEMU. I can try to see where in QEMU it hangs, but not sure how useful this would be. If anyone has an idea how to approach this, I'm one big ear.
For the record, I run only threadtest
as follows:
`QEMU=...
for in in `seq 1 100`; do
$QEMU "-L" "/home/jenkins/riscv-debian/rootfs" "/tmp/omr/build/fvtest/threadtest/omrthreadtest" "--gtest_output=xml:/tmp/omr/build/fvtest/threadtest/omrthreadtest-results.xml" "--gtest_filter=-PriorityInterrupt.*:RWMutex*" | perl -pe 'use POSIX strftime; print strftime "[%Y-%m-%d %H:%M:%S] ", localtime';
done
and:
I tried on Eclipse CI node with the bit-identical static QEMU binary copied from my (working) deb10 image to no avail, still hangs.
Another observation: when I replaced currently used sysroot on CI node with "fresh" sysroot, hangs are lot less frequent (one in ~25 compared to 1 in ~3), but it still hangs.
I also tried running tests on my freshly build deb10 image with sysroot copied from CI node - didn't hang once in 100 runs.
I also noticed that the uptime of CI node (deb10-x64-1
) is > 300days.
@AdamBrousseau: how much hassle is to reboot the whole deb10-x64-1
VM? I know it should not matter, but I have no idea and just trying different things.
Anyways, I'll try to update sysroot on CI node to "fresh" (with newer versions of libraries, most notably glibc) as it might to reduce hangups (but not fix them).
Rebooted. Let me know how it goes.
@AdamBrousseau: Unfortunately, reboot did not help, but thanks anyway!
I'm going to update qemu-riscv64
to 7.2.0 (from 5.0.0) and sysroot on deb10-x64-1
as it seems that with this combination hangups are less likely.
I'm going to update qemu-riscv64 to 7.2.0 (from 5.0.0) and sysroot on deb10-x64-1
I did it but the test build hanged just like before. Maybe just bad luck, but when I tried to build ant run test manually it was far more stable. Anyways, I'm going to keep new versions there for some time (I left backups on the node so reverting is a matter of redirecting symlinks back)
With new sysroot, PortSignalExtendedTests
is failing, here's a fix: #6938
Just adding reference to PR #6912 as it might help with this. Maybe.
For some time, we're experiencing itermittent failures with RISC-V CI cross-compiling job, see for example #6706 or #6704.
This issue is create to track progress on stabilising RISC-V CI cross-compiling job.