eclipse-omr / omr

Eclipse OMR™ Cross platform components for building reliable, high performance language runtimes
http://www.eclipse.org/omr
Other
949 stars 396 forks source link

Add native RISC-V nodes to OMR CI testing #7530

Open 0xdaryl opened 1 week ago

0xdaryl commented 1 week ago

Eclipse now provides limited access to native RISC-V nodes, upon request. Details here [1].

We will have to evaluate whether these are suitable for native builds and test, or just as test nodes.

[1] https://github.com/eclipse-cbi/cbi/wiki#whats-provided

@janvrany @AdamBrousseau @jdekonin FYI

AdamBrousseau commented 1 week ago

https://github.com/eclipse-cbi/jiro/wiki/Dedicated-build-agents

Riscv64 servers based on VisionFive2 SOC boards, with 8 GB RAM, 4 cores, 960 GB SSD Nvme storage. At this point, we estimate that each machine can host up to 4 containers with oversubscription.

Containers are delivered with latest (at the moment of container creation) Ubuntu https://hub.docker.com/r/riscv64/ubuntu/ and with the following tooling:

Temurin JDK 21 LTS https://adoptium.net/en-GB/temurin/releases/?arch=riscv64&version=21 Maven 3.9.9 Ant 1.10.5 Additional packages installed:

build-essential libboost-all-dev libssl-dev libgtk-3-dev libglu1-mesa-dev libgtk-3-dev

This will be mostly on a first-come-first-serve basis. If more projects need more compute time on riscv64, we will need to delegate projects to cloud services like Scaleway at some point.

Generally we use the openj9 playbooks to setup omr machines because that is a superset of the tools needed for OMR builds. Do we have a minimal set of tools required documented somewhere? Perhaps we can craft our own container that eclipse will host for us. Alternatively we can install the necessary tooling in the container on the fly but that will require a bit of build work to get going.

For the record, we have riscv builds that run in qemu(?) on x64 linux. They have been disabled since May 2024. I cannot recall why exactly. Looks like the 2 machines that can run those were offline for a while. When the builds were running they were taking about an hour to complete. I have kicked off a build to see if it still works.

https://ci.eclipse.org/omr/job/Build-linux_riscv64_cross/494/

0xdaryl commented 1 week ago

I cannot recall why exactly.

They were consuming a lot of memory and were falling over intermittently. I thought there was an issue created for this, but can't find it at the moment. What I recall, the harness running the OMR compiler during test is leaking memory between compilations. While there are hacks to fix this, as we discussed somewhere the preferred way to fix it is to understand all the holes and plug them.

janvrany commented 1 week ago

These builds were disabled because they were unreliable. The exact cause was never found, but my suspicion is excessive memory consumption caused by some interference with OMR's code cache and QEMU's TCG.

janvrany commented 1 week ago

Here's what I installed on my RISC-V machines in order to compile and test both OMR and OpenJ9:

https://github.com/janvrany/debian-for-toys/blob/master/common/mk-fs-hooks/customize50-dev-tools.sh

But the list above is not minimal. @jdekonin wrote a Dockerfile used to build cross-compilation environment that was used in the - now disabled - builds. Here's the relevant bit:

https://github.com/eclipse-omr/omr/blob/master/buildenv/docker/riscv64/debian11/Dockerfile#L81

Note, that you need riscv.h and riscv-opc.h:

https://github.com/janvrany/debian-for-toys/blob/master/common/mk-fs-hooks/customize50-dev-tools.sh#L37-L38

AdamBrousseau commented 1 week ago

Do we expect real hardware to be more reliable then since it won't be using qemu?

AdamBrousseau commented 1 week ago

https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/5254

janvrany commented 1 week ago

Do we expect real hardware to be more reliable then since it won't be using qemu?

I would, but I never hard problem with QEMU either (but the machine had/has 16GB RAM).

Speaking if RISC-V hardware, I have the exact same board for some time and never had that kind of problems (was running jobs on it since "old" job was disabled just have a feeling if things are still okay). I'm not running it now because it is consistently failing because of Python - see #7496 and #7279. Once that is resolved, I'll enable it again and can watch it more closely if that helps.