NVIDIA / TorchFort

An Online Deep Learning Interface for HPC programs on NVIDIA GPUs
https://nvidia.github.io/TorchFort/
Other
154 stars 19 forks source link

Unable to build docker image with dockerfile #27

Open tuanmp opened 2 weeks ago

tuanmp commented 2 weeks ago

I am struggling to build a docker image with the dockerfile provided. Apparently the issue is that the base image is with arm64 but the hpc pack is with x86. Could you test again to make sure the image can be built from the dockerfile?

azrael417 commented 2 weeks ago

Hello Tuan, thanks for reaching out. The base image is actually multi-arch and depending on what architecture you are building the image, docker will pull the image for the corresponding architecture. What is your setup for building the image? Are you trying to build it on an Arm platform (for example Grace Hopper, on a Mac with M-type CPU, etc.) and then run it on an x86 platform?

In this case, you can in principle pull the image for the other arch, using docker pull --platform=<arch>, however I would recommend building the image on a machine with the targeted arch directly.

tuanmp commented 2 weeks ago

Thanks for the reply. This is exactly what I’m doing. I saw that docker enables cross-platform building. Have you tried to see if the dockerfile is compilable cross-platform?

azrael417 commented 2 weeks ago

Hello Tuan,

I had a look at cross-compilation and it is not really trivial. Check out https://docs.docker.com/build/building/multi-platform/#cross-compiling-a-go-application if you are interested. It seems that the way of getting it to work is to make a host arch build container but then evoke cross compilation for each target. I am not sure that this is so simple, I made very bad experiences with cross compilation in the past. What should work though is building it on the target arch directly and then push it to some registry (for example docker hub) and pull it where you need it. Also, some systems support squash fs (for example systems using Pyxis for container launches). In that case you could build your image on some x86 system, enroot dump it into an sqsh file (check https://github.com/NVIDIA/enroot) and then rsync the sqsh file over. Lastly, you can also build TorchFort natively on the system if it does not have container build support. Building TorchFort should be rather smooth once you built all the dependencies such as PyTorch.

Let me know if you have any questions Best Thorsten