Closed Algiane closed 2 months ago
for now, the Docker images and associated Dockerfiles have been produce as a PoC and no attempt was made to reduce their memory sizes that are pretty large (~4GB for the compressed size on DockerHub, ~9GB once image is run). Cuda install for example uses lot of storage (~4GB), I guess that it can be reduced by copying only the needed library files...
the pangea 3 job is a draft too and a little time (~5 / 10 mins) could be saved by creating a specific image dedicated to the uraimo job (we need a linux OS running on ppc64le
arch and with Docker available). For now, docker is installed by the job on the very light ubuntu image provided by uraimo.
qemu layer: slowdown by a factor about 14
The slowdown linked to the call of the qemu-user-static
emulation layer is evaluated by comparing the compilation time of the finitElement
library of Geos repository on 4 cores with 32GB of memory:
2m11.743s
- user 4m22.582s
27m58.993s
- user 73m23.212s
uraimo/run-on-arch-action: slowdown by a factor about 15 (no particular degradation comparing to qemu layer)
The slowdown linked to the call of the run-on-arch-action
has been evaluated on an external code without any dependencies (for sake of simplicity as it removes the need to construct a suitable docker image for target architecture). Test results are available here:
1m20
- user 1m13
20m32.581s
- user 20m3.864s
Even if we succeed to build TPLs in the suitable time, as GEOS Cuda build is a lot slower (~73m
on 4 cores in Debug
mode and 100m
in Release
one), it will not be possible to use the emulation layer as it.
Nevertheless we can list some improvement paths for the current PR and the TPL build:
ccache
or sscache
to speedup build (https://github.com/uraimo/run-on-arch-action/issues/4)disable TPLs that will not be used by the pangea3 GEOs job (Trilinos for example)
From a more global perspective (GEOS project), thanks to @sframba propositions:
we may attempt to cross compile the TPLs and GEOS for the target arch and to run the unit tests using a self-hosted runner with GPUs and the emulation layer (tests are about 2 or 3 minutes in Release
mode). Note that it may be a little tricky to deploy (I am not expert in cross compilation so I don't understand how we will ensure to have the suitable version of the various needed shared libs)
we may try to get a ppc64le
host to run our tests. It seems that it is not natively provided by github (see related doc but that some workaround exists (see Github:Self-hosted runners on ppc64le architectures or the list of self-hosted GitHub Action runners).
@sframba @TotoGaz : you can read the PR comments if interested by a feedback on this work that you initiated with @XL64 .
Hello @Algiane thank you for your comments.
The timing issue is surely something to keep in mind, but before getting to this, I'd like to get a little more information about the process.
qemu
, are you able to compile a ppc
executable that runs on P3?CUDA
program. Can you compile it and run it on P3?Hi @TotoGaz ,
On pangea III
, I can run the acous3D_abc_smoke.xml
test case with the geos
binary I built.
I don't know how to monitor that it really uses the GPUs but running the same test on the P3_USERS_CPU
queue fails with the no CUDA-capable device
error.
The geos TPLs
and geos
binary have been built:
amd64
;qemu-user-static
docker image for the emulation layer;7g8efcehpff/pangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18:3
ppc64le
docker image.For now, the test of the executable on P3 is tweaked. I:
TPL
install directory, the lvarray
shared library created by the geos
build and the geos
binary on P3;python3.6
library toward the python3.8
one (I didn't take care of the python version on my docker image... of course it is not the same than on P3) ;LD_LIBRARY_PATH
;gcc
, cuda
, ompi
and openblas
modules.Please let me know if you need more tests.
Best
On
pangea III
, I can run theacous3D_abc_smoke.xml
test case with thegeos
binary I built. I don't know how to monitor that it really uses the GPUs but running the same test on theP3_USERS_CPU
queue fails with theno CUDA-capable device
error.
For that specific purpose, you can run geos
with the --trace-data-migration, Trace host-device data migration
command line option. You'll be able to see data moving from and to the device.
@Algiane Is it fair to state that now the issue is really a timing issue? That if we had a very very powerful machine, that would work OK?
Cross compiling is something that can be very challenging. Furthermore, cross compiling the TPLs
means cross compiling ~20 libs with their sometimes clunky build systems. And you add CUDA on top of that. I do not know how to manage that, that would require a lot of dedication, to say the least.
Thanks for the --trace-data-migration
tip: it confirms that some LvArrays are moved on/from GPUs.
For me, with this method we have 2 issues:
no space left on device
error.For now, as the emulation seems to be a dead-end but we still don't have a solution to test the P3 configuration, I will let this PR as a draft and try to see if we can connect a ppc64
runner to the github-actions as a self-hosted runner: it can be an alternative way if we can buy a small ppc64
machine.
Best
- the compilation time;
We have a powerful self-hosted machine. Do you think that could do it?
- the size of the docker images: the image with the pre-built TPLs is very close to the 10 GB limit and I think that the base image (the image with the copy of the pangea modules that are needed to build the TPLs but without the TPLs built) is not very far. The cuda module alone is already more than 4 GB. Finally, it was not possible to work directly inside the containers and I had to mount my home to avoid the
no space left on device
error.
I'm surprised that this gets so big. E.g. https://hub.docker.com/r/geosx/pecan-gpu-gcc8.2.0-openmpi4.0.1-mkl2019.5-cuda11.5.119/tags is ~4.4GB (still very big, but half). Do you know what get's it so big? We're using a lot the multi-stage approach to remove the temporaries. Are you doing the same?
Also, if we manage to run it on a comfortable self-hosted machine, would the size issue become secondary?
- the compilation time;
We have a powerful self-hosted machine. Do you think that could do it?
Maybe: it depends on the time needed to build the TPLs and Geos on this machine. We can multiply these times by 15 to have an order of the times needed with the emulation layer.
- the size of the docker images: the image with the pre-built TPLs is very close to the 10 GB limit and I think that the base image (the image with the copy of the pangea modules that are needed to build the TPLs but without the TPLs built) is not very far. The cuda module alone is already more than 4 GB. Finally, it was not possible to work directly inside the containers and I had to mount my home to avoid the
no space left on device
error.I'm surprised that this gets so big. E.g. https://hub.docker.com/r/geosx/pecan-gpu-gcc8.2.0-openmpi4.0.1-mkl2019.5-cuda11.5.119/tags is ~4.4GB (still very big, but half). Do you know what get's it so big? We're using a lot the multi-stage approach to remove the temporaries. Are you doing the same?
Also, if we manage to run it on a comfortable self-hosted machine, would the size issue become secondary?
I have about the same size for the image on DockerHub but it uses compression. Once pulled, for example, the pecan-gpu
image is about 10.8 GB and I quickly get stuck with no space left.
It is less annoying than the time issue (as it is possible to work in an external volume).
@sframba : I have tested the connection of a ppc64le
self-hosted runner to github-actions using a non official runner (https://github.com/ChristopherHX/github-act-runner). It worked smoothly for a simple script execution.
New job that:
qemu
through theqemu-user-static
image);deploy a AlmaLinux-8 image on which TPLs' dependencies are installed with respect to the pangea3 modules needed to build the TPLs:
The Dockerfile used to build this image is provided in
docker/TotalEnergies/Pangea3-base.Dockerfile
and available on my DockerHub account under thepangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18
name with tag4
:7g8efcehpff/pangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18:4
.docker/TotalEnergies/Pangea3.Dockerfile
file that allows to build the ppc64le docker image with built and and installed TPLs for geos;RUNS_ON
matrix variable to the job matrix to allow the use of different runners (it is needed to run on a self-hosted runner more powerful than the default github runners due to the slowdown introduced by the emulation layer);docker_build_and_push.sh
script and rename this scriptdocker_build.sh
;streak2
: it solves errors when pushing images (access denied
) due to race condition between jobs (if 2 jobs run at the same time on the machine, one job may remove the login credentials between the moment the first job login to docker and the moment it attempts to push the image);docker push
command.Linked to EPIC TTE Builds and Geos PR 3159