aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
203 stars 86 forks source link

17.SM-modelparallelv2 uses pytorch binary that depends on deprecated conda packages #457

Open junpuf opened 1 month ago

junpuf commented 1 month ago

The test case 17. SM-modelparallelv2, uses a custom pytorch binaries pytorch="2.2.0=sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0 which declared dependency on aws-ofi-nccl >=1.7.1,<2.0. The expectation was that the aws-ofi-nccl package will be consumed from the AWS PyTorch conda channel (https://aws-pytorch-doc.com/).

The following package could not be installed
└─ pytorch ==2.2.0 sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0 is not installable because it requires
   └─ aws-ofi-nccl >=1.7.1,<2.0 , which does not exist (perhaps a missing channel).

The conda channel has been deprecated, as mentioned in deprecation annoucement, it is recommended for the team who built pytorch="2.2.0=sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0 to rebuild this binary and remove dependency on aws-ofi-nccl >=1.7.1,<2.0.

junpuf commented 1 month ago

453