PawseySC / pawsey-containers

A collection of Dockerfiles and Singularity deffiles for Pawsey-supported images
15 stars 11 forks source link

Adds the `rocm-mpich-base` container. #15

Closed dipietrantonio closed 8 months ago

dipietrantonio commented 1 year ago

I will prepare the equivalent docker files after having done more experimenting with it.

AlexisEspinosaGayosso commented 1 year ago

@dipietrantonio , would it be possible to separate this as three different pull requests? One for the rocm-mpich-base.dockerfile, one for the pytorch and a third for tensorflow? In that way, a fix for one will not block the others (if the others are ready).

For example, I think we are in perfect position of pulling the rocm one. Indeed, I think this is sort of "urgent". As many other images (ours, a users made) can be built FROM this one. By the way, in that one, maybe we should stick to rocm/5.6 due to driver compatibility.

In regards to the PyTorch one, I did not read carefully, but can that and also the Tensorflow ones start FROM the rocm-mpich one using dockerfiles? I mean:

FROM rocm-mpich-base:3.4.3_ubuntu22.04

Cheers.

dipietrantonio commented 1 year ago

Hi @AlexisEspinosaGayosso ,

would it be possible to separate this as three different pull requests?
Yes I will do that, makes more sense.

For example, I think we are in perfect position of pulling the rocm one.

I am still working on that one, testing a few remaining things. I hope I can push it today to our Pawsey repository.

I will update this PR to be just about the rocm-mpich-base. I think the appropriate name should be ubuntu:22-rocm5.6.0-mpich3.4.3 as this is mainly a base Ubuntu image, containing a few needed libraries. We should have a discussion about naming conventions.

dipietrantonio commented 1 year ago

@pelahi @AlexisEspinosaGayosso this PR is ready for review.

dipietrantonio commented 11 months ago

I install libfabric as a dependency of https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl , which is needed for optimal performance of the ROCm Communication Collective Library (RCCL). So this is a AMD specific build. I might get away by installing lib fabric and aws-ofi-rccl on top of the mpich-lustre container. I will try.