Closed Ridhamz-nd closed 1 year ago
I noticed that you are using G4Dn instances. Unfortunately, plugin doesn't support g4dn platform. We only support p3dn and p4d platforms.
Looking at the libfabric error,
libfabric:465:1673041637:efa:cq:rxr_cq_write_tx_error():243<warn> rxr_cq_write_tx_error: err: 21, prov_err: Unknown error -21 (21)
@wzamazon Can you help reason out the error code?
@rashikakheria Thanks for the response.
After reaching out to AWS support, I found out that the security group must have an ingress and egress rule where it should allow all traffic if the destination has the same security group. This rule should be added even if the a rule such as
IP version | Type | Protocol | Port range | Destination
IPv4 | All traffic | All | All | 0.0.0.0/0
is present. After adding the rule, the instances are able to communicate with each other using EFA.
Hello aws_ofi_nccl maintainers,
Please let me know if this is not the best location to post the issue and I will close this issue.
I am unable to figure out why the process is hanging after the error message is shown.
My training setup: 2 ml.g4dn.12xlarge instances on AWS Sagemaker trying to run distributed training with Pytorch base image
763104351884.dkr.ecr.us-west-1.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker
. The two instances are running inside a private subnet with a NAT gateway attached to the subnet.All outputs are from host-1 Output of
lspci -i efa
:Output of
cat /opt/amazon/efa_installed_packages
:Output of
/opt/amazon/efa/bin/fi_info -p efa
:Output of training job: distributed training is initialized with nccl backend in pytorch using the mmaction2 training library. I set FI_EFA_USE_DEVICE_RDMA=0 because the T4 gpus do not support RDMA. Also, the cmd is run as os.system() command in the entrypoint passed to sagemaker cmd=
I see the same error on the algo-2 instance as well.
Pytorch version and helper output by mmaction2: