Test the ecosystem integration against GPU containers

titu1994 commented 2 years ago

🚀 Feature

Nemo tests it's own CI on a Pytorch container from NGC (versioned as YY.MM) and these are generally available on other cloud providers too. Note that - usually once pytorch has a public release, it takes at least one month for the next container to actually have the public pytorch release. By actually have the released pytorch, I mean that the current container will have an alpha release of pytorch with some cherry-picked changes vs the actual full new release in public.

This can cause cases where improper version checking (using distutils instead of packaging.version.Version) can fail these alpha version comparison tests and cause PTL inside of the container to pick incorrect code paths. So the ecosystem CI will work fine ... but when you run it on a pytorch container released from Nvidia (ie on most cloud providers) it may fail (and not just Nemo, anything that uses PTL and hits that code path).

So maybe on a separate test prior to release, test the ecosystem CI on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.

Motivation

For a current example of exactly how we have to patch for such an issue right now (wrt Pytorch 1.10, NGC Container 21.01 and Pytorch Lightning 1.5.9), https://github.com/NVIDIA/NeMo/blob/8e15ba43ba0a17b456d3bfa09444574ef1faa301/Jenkinsfile#L70-L76 due to an issue regarding torchtext.

For an extreme case of exactly how bad things become - we had to adaptively install torch, PTL and nemo dependencies based on whether the install occurred inside a container or not.. https://github.com/NVIDIA/NeMo/blob/r1.0.0rc1/setup.py#L107-L146

Pitch

Maybe test the ecosystem CI (or just even PTL alone) on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.

Alternatives

Apart from manual patching of PTL source at install time, we haven't found any better solution than to wait it out for a month or two before the container actually contains the latest code from the latest torch release.

Borda commented 2 years ago

Maybe test the ecosystem CI (or just even PTL alone) on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.

I see and it would be quite a useful feature to allow customer images... I think it could be very feasible for the GPU testing which is already running customer docker image, but how much do you think is needed also for base CPU testing?

One more point, and it is rather thinking than a complaining... we are talking about two kinds of users (a) heavily rely on containers/docker [probably some corporate user] and (b) casual users using mostly PyPI/Conda registry... so I may say the at would like to serve both, so for example of Nemo I would include both testing a,b :rabbit:

titu1994 commented 2 years ago

CPU testing should be done only on cpu instances when possible, to avoid incurring gpu runtime costs. The containers are mostly tailored towards GPUs since your primary use case for containers are for multi gpu or multi node deployments for training.

For Nemo, the vast majority of users can get by with conda env and no docker, so I do agree with this that it's a niche problem for a subset of users.

That subset of users does include the entire Nemo research team plus a few external research teams, who build containers and run multi node jobs on the clusters. So a breakage of support usually means we wait out upgrading our containers for periods of 1-2 months.

Also, the CI tests running in Nemo are in the container, but we simulate the install environment of the user - ie we use a bare bones pytorch base container and then follow regular pip and conda install steps. Now ofc the torch environment is based on a container which is not what normal users will face, but still it's close to real world install scenarios.

Lightning-AI / ecosystem-ci