Action with Pangea-3 installation reproduction and ppc64le emulation

Algiane commented 10 months ago

New job that:

emulates a ppc64 architecture (using the docker/setup-qemu-action that relies on the use of qemu through the qemu-user-static image);
deploy a AlmaLinux-8 image on which TPLs' dependencies are installed with respect to the pangea3 modules needed to build the TPLs:
- CMake-3.26
- gcc-9.4.0
- ompi-4.1.2
- cuda-11.5.0
- openblas-0.3.18
- lsf-10.1
The Dockerfile used to build this image is provided in docker/TotalEnergies/Pangea3-base.Dockerfile and available on my DockerHub account under the pangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18 name with tag 4: 7g8efcehpff/pangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18:4.
adds a docker/TotalEnergies/Pangea3.Dockerfile file that allows to build the ppc64le docker image with built and and installed TPLs for geos;
adds a RUNS_ON matrix variable to the job matrix to allow the use of different runners (it is needed to run on a self-hosted runner more powerful than the default github runners due to the slowdown introduced by the emulation layer);
removes the push step from the docker_build_and_push.sh script and rename this script docker_build.sh;
moves the authentication to docker before the attempt to push the docker image and do not logout on streak2: it solves errors when pushing images (access denied) due to race condition between jobs (if 2 jobs run at the same time on the machine, one job may remove the login credentials between the moment the first job login to docker and the moment it attempts to push the image);
adds a dedicated step for the docker push command.

Linked to EPIC TTE Builds and Geos PR 3159

Algiane commented 9 months ago

Preliminary Remarks

for now, the Docker images and associated Dockerfiles have been produce as a PoC and no attempt was made to reduce their memory sizes that are pretty large (~4GB for the compressed size on DockerHub, ~9GB once image is run). Cuda install for example uses lot of storage (~4GB), I guess that it can be reduced by copying only the needed library files...
the pangea 3 job is a draft too and a little time (~5 / 10 mins) could be saved by creating a specific image dedicated to the uraimo job (we need a linux OS running on ppc64le arch and with Docker available). For now, docker is installed by the job on the very light ubuntu image provided by uraimo.

Job Failure

Compilation fails due to time limit (TPLs compilation takes too much time);

Evaluation of emulation layer slowdown

qemu layer: slowdown by a factor about 14 The slowdown linked to the call of the qemu-user-static emulation layer is evaluated by comparing the compilation time of the finitElement library of Geos repository on 4 cores with 32GB of memory:
- without qemu: real 2m11.743s - user 4m22.582s
- with qemu: real 27m58.993s - user 73m23.212s
uraimo/run-on-arch-action: slowdown by a factor about 15 (no particular degradation comparing to qemu layer) The slowdown linked to the call of the run-on-arch-action has been evaluated on an external code without any dependencies (for sake of simplicity as it removes the need to construct a suitable docker image for target architecture). Test results are available here:
- without run-on-arch emu: real 1m20 - user 1m13
- with run-on-arch emu: real 20m32.581s - user 20m3.864s

Perspectives

Even if we succeed to build TPLs in the suitable time, as GEOS Cuda build is a lot slower (~73m on 4 cores in Debug mode and 100m in Release one), it will not be possible to use the emulation layer as it.

Nevertheless we can list some improvement paths for the current PR and the TPL build:

use ccache or sscache to speedup build (https://github.com/uraimo/run-on-arch-action/issues/4)
disable TPLs that will not be used by the pangea3 GEOs job (Trilinos for example)

From a more global perspective (GEOS project), thanks to @sframba propositions:
- we may attempt to cross compile the TPLs and GEOS for the target arch and to run the unit tests using a self-hosted runner with GPUs and the emulation layer (tests are about 2 or 3 minutes in Release mode). Note that it may be a little tricky to deploy (I am not expert in cross compilation so I don't understand how we will ensure to have the suitable version of the various needed shared libs)
- we may try to get a ppc64le host to run our tests. It seems that it is not natively provided by github (see related doc but that some workaround exists (see Github:Self-hosted runners on ppc64le architectures or the list of self-hosted GitHub Action runners).

Algiane commented 9 months ago

@sframba @TotoGaz : you can read the PR comments if interested by a feedback on this work that you initiated with @XL64 .

TotoGaz commented 9 months ago

Hello @Algiane thank you for your comments.

The timing issue is surely something to keep in mind, but before getting to this, I'd like to get a little more information about the process.

Using qemu, are you able to compile a ppc executable that runs on P3?
Same question, but with some simple CUDA program. Can you compile it and run it on P3?

Algiane commented 8 months ago

Hi @TotoGaz ,

On pangea III, I can run the acous3D_abc_smoke.xml test case with the geos binary I built. I don't know how to monitor that it really uses the GPUs but running the same test on the P3_USERS_CPU queue fails with the no CUDA-capable device error.

The geos TPLs and geos binary have been built:

on amd64;
with the deployment of the qemu-user-static docker image for the emulation layer;
using the 7g8efcehpff/pangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18:3 ppc64le docker image.

For now, the test of the executable on P3 is tweaked. I:

copied the TPL install directory, the lvarray shared library created by the geos build and the geos binary on P3;
created a symlink from the python3.6 library toward the python3.8 one (I didn't take care of the python version on my docker image... of course it is not the same than on P3) ;
exported the suitable paths to the TPLs libraries in my LD_LIBRARY_PATH;
loaded the suitable gcc, cuda, ompi and openblas modules.

Please let me know if you need more tests.

Best

TotoGaz commented 8 months ago

On pangea III, I can run the acous3D_abc_smoke.xml test case with the geos binary I built. I don't know how to monitor that it really uses the GPUs but running the same test on the P3_USERS_CPU queue fails with the no CUDA-capable device error.

For that specific purpose, you can run geos with the --trace-data-migration, Trace host-device data migration command line option. You'll be able to see data moving from and to the device.

TotoGaz commented 8 months ago

@Algiane Is it fair to state that now the issue is really a timing issue? That if we had a very very powerful machine, that would work OK?

Cross compiling is something that can be very challenging. Furthermore, cross compiling the TPLs means cross compiling ~20 libs with their sometimes clunky build systems. And you add CUDA on top of that. I do not know how to manage that, that would require a lot of dedication, to say the least.

Algiane commented 8 months ago

Thanks for the --trace-data-migration tip: it confirms that some LvArrays are moved on/from GPUs.

For me, with this method we have 2 issues:

1. the compilation time;
1. the size of the docker images: the image with the pre-built TPLs is very close to the 10 GB limit and I think that the base image (the image with the copy of the pangea modules that are needed to build the TPLs but without the TPLs built) is not very far. The cuda module alone is already more than 4 GB. Finally, it was not possible to work directly inside the containers and I had to mount my home to avoid the no space left on device error.

For now, as the emulation seems to be a dead-end but we still don't have a solution to test the P3 configuration, I will let this PR as a draft and try to see if we can connect a ppc64 runner to the github-actions as a self-hosted runner: it can be an alternative way if we can buy a small ppc64 machine.

Best

TotoGaz commented 8 months ago

the compilation time;

We have a powerful self-hosted machine. Do you think that could do it?

the size of the docker images: the image with the pre-built TPLs is very close to the 10 GB limit and I think that the base image (the image with the copy of the pangea modules that are needed to build the TPLs but without the TPLs built) is not very far. The cuda module alone is already more than 4 GB. Finally, it was not possible to work directly inside the containers and I had to mount my home to avoid the no space left on device error.

I'm surprised that this gets so big. E.g. https://hub.docker.com/r/geosx/pecan-gpu-gcc8.2.0-openmpi4.0.1-mkl2019.5-cuda11.5.119/tags is ~4.4GB (still very big, but half). Do you know what get's it so big? We're using a lot the multi-stage approach to remove the temporaries. Are you doing the same?

Also, if we manage to run it on a comfortable self-hosted machine, would the size issue become secondary?

Algiane commented 8 months ago

the compilation time;

We have a powerful self-hosted machine. Do you think that could do it?

Maybe: it depends on the time needed to build the TPLs and Geos on this machine. We can multiply these times by 15 to have an order of the times needed with the emulation layer.

the size of the docker images: the image with the pre-built TPLs is very close to the 10 GB limit and I think that the base image (the image with the copy of the pangea modules that are needed to build the TPLs but without the TPLs built) is not very far. The cuda module alone is already more than 4 GB. Finally, it was not possible to work directly inside the containers and I had to mount my home to avoid the no space left on device error.

I'm surprised that this gets so big. E.g. https://hub.docker.com/r/geosx/pecan-gpu-gcc8.2.0-openmpi4.0.1-mkl2019.5-cuda11.5.119/tags is ~4.4GB (still very big, but half). Do you know what get's it so big? We're using a lot the multi-stage approach to remove the temporaries. Are you doing the same?

Also, if we manage to run it on a comfortable self-hosted machine, would the size issue become secondary?

I have about the same size for the image on DockerHub but it uses compression. Once pulled, for example, the pecan-gpu image is about 10.8 GB and I quickly get stuck with no space left. It is less annoying than the time issue (as it is possible to work in an external volume).

Algiane commented 8 months ago

@sframba : I have tested the connection of a ppc64le self-hosted runner to github-actions using a non official runner (https://github.com/ChristopherHX/github-act-runner). It worked smoothly for a simple script execution.

GEOS-DEV / thirdPartyLibs