Closed dipietrantonio closed 8 months ago
@dipietrantonio , would it be possible to separate this as three different pull requests? One for the rocm-mpich-base.dockerfile, one for the pytorch and a third for tensorflow? In that way, a fix for one will not block the others (if the others are ready).
For example, I think we are in perfect position of pulling the rocm one. Indeed, I think this is sort of "urgent". As many other images (ours, a users made) can be built FROM this one. By the way, in that one, maybe we should stick to rocm/5.6 due to driver compatibility.
In regards to the PyTorch one, I did not read carefully, but can that and also the Tensorflow ones start FROM the rocm-mpich one using dockerfiles? I mean:
FROM rocm-mpich-base:3.4.3_ubuntu22.04
Cheers.
Hi @AlexisEspinosaGayosso ,
would it be possible to separate this as three different pull requests?
Yes I will do that, makes more sense.For example, I think we are in perfect position of pulling the rocm one.
I am still working on that one, testing a few remaining things. I hope I can push it today to our Pawsey repository.
I will update this PR to be just about the rocm-mpich-base
. I think the appropriate name should be ubuntu:22-rocm5.6.0-mpich3.4.3
as this is mainly a base Ubuntu image, containing a few needed libraries. We should have a discussion about naming conventions.
@pelahi @AlexisEspinosaGayosso this PR is ready for review.
I install libfabric as a dependency of https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl , which is needed for optimal performance of the ROCm Communication Collective Library (RCCL). So this is a AMD specific build. I might get away by installing lib fabric and aws-ofi-rccl on top of the mpich-lustre container. I will try.
I will prepare the equivalent docker files after having done more experimenting with it.