coreweave / ml-containers

MIT License
19 stars 3 forks source link

feat(torch-extras): Add `--distributed_*` and `--group_norm` to bundled Apex, fix CI on updates #34

Closed Eta0 closed 1 year ago

Eta0 commented 1 year ago

Expand Apex & CI Reliability

This change adds the --distributed_adam, --distributed_lamb, and --group_norm build options to the version of Apex bundled with torch-extras. Apex is updated to its latest release to add --group_norm support.

This also includes fixes for torch-extras's CI so that it automatically rebuilds itself from the most recent main-branch ml-containers/torch builds when:

  1. The torch-extras/** code itself is changed, and
  2. ml-containers/torch is not also being rebuilt simultaneously.

Previously it could only rebuild on new builds of ml-containers/torch or manual workflow dispatch because we had no way to discover the correct bases to use, or to handle concurrent triggering alongside a ml-containers/torch build.

It will still rebuild from either subset of the torch:base or torch:nccl images as necessary as long as it doesn't overlap with a concurrent run of torch-base.yml or torch-nccl.yml workflows, respectively. Eliminating overlap is necessary to not produce two images with identical tags but different content, which would overwrite each other when published, and leave the actual contents of the image indeterminate.

This also removes the label distinction between our large and standard CI runners in all workflow files, as the distinction no longer exists on the runners themselves, which was blocking torch-extras builds that had been requesting them.