17.SM-modelparallelv2 uses pytorch binary that depends on deprecated conda packages

The test case 17. SM-modelparallelv2, uses a custom pytorch binaries pytorch="2.2.0=sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0 which declared dependency on aws-ofi-nccl >=1.7.1,<2.0. The expectation was that the aws-ofi-nccl package will be consumed from the AWS PyTorch conda channel (https://aws-pytorch-doc.com/).

The following package could not be installed
└─ pytorch ==2.2.0 sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0 is not installable because it requires
   └─ aws-ofi-nccl >=1.7.1,<2.0 , which does not exist (perhaps a missing channel).

The conda channel has been deprecated, as mentioned in deprecation annoucement, it is recommended for the team who built pytorch="2.2.0=sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0 to rebuild this binary and remove dependency on aws-ofi-nccl >=1.7.1,<2.0.

aws-samples / awsome-distributed-training

17.SM-modelparallelv2 uses pytorch binary that depends on deprecated conda packages #457

453