Update lit_gpt commit to PyTorch 2.2

GoogleCloudPlatform / ai-infra-cluster-provisioning

Apache License 2.0

37 stars 25 forks source link

The PR's primary purpose is updating lit-gpt's commit to a PyTorch 2.2 commit. This also comes with a few other things:

Removes building of flash_attn resulting in more reliable lit-gpt image builds.
Removes Lightning.Trainer in favor of Lightning.Fabric (provided by the lit-gpt openwebtext sample)
Refactors most of the nsight workflow into a separate Callback class.
Updates the base image to 24.01, which comes bundled with PyTorch 2.2, and NCCL 2.19.
Updates the ncclPlugin and rxdmContainer versions
Adds volume mount for /data onto the host which prevents re-downloading the dataset on subsequent runs.

GoogleCloudPlatform / ai-infra-cluster-provisioning