Closed monajalal closed 1 year ago
Looks like a PyTorch error, which I'm not familiar with. I don't see any NCCL WARN
, so it doesn't look like NCCL failed. Note you're using ndv4-topo.xml
which is not the right topology for the Azure instance you are using. So you might be following a recipe that was not intended for this type of instance.
@sjeaugey Good point Sylvain. I am new to this. Is there a resource you could suggest to use if my topology is like this for the ndv4-top.xml or is it necessary to use a topology file? Is it not able to automatically figure it?
I have 4 nodes in a cluster, each of these nodes have 4 K80 GPUs.
ndv4-topo.xml is for the "NDv4" instance type. I don't think there is a file for that kind of instances, so the best is to not set NCCL_TOPO_FILE.
@sjeaugey thanks for checking. Yes, I am not using it. It is commented (original templates mentioned to uncomment if using A100)
(base) mona@ard-gpu-01:~/ARD-ML-CIFAR10$ rg NCCL_TOPO_FILE
data-science/environment/Dockerfile
32:# ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"
mlops/azureml/train/pipeline.yaml
52: # NCCL_TOPO_FILE: "/opt/microsoft/ndv4-topo.xml" # Use specific topology file for A100
Ok. Still back to my original point: it seems PyTorch is failing somehow in network communication. I don't know how to fix that though.
From the log you posted, looks like c10d's out-of-band exchange of the ncclUniqueId over TCP is timing out. I don't think this has to do with NCCL. I am not familiar with Azure, but I would check things like Security Group rules to make sure the right ports are open. Are you able to run this with a different ProcessGroup in PyTorch (Gloo or MPI, perhaps)?
Hey @monajalal, Did you manage to fix the issue? I'm getting the same error, on an azure Compute node with 8xA100s with FSDP distributed training.
I tried changing the timeout limit and the problem was solved.
dist.init_process_group(backend="nccl", timeout=datetime.timedelta(days=2))
The cause in my case was that the codebase used the old args.local_rank
syntax. Once we converted to the local_rank = int(os.environ['local_rank'])
. The issue disappeared. The timeout didn't work.
I am using SFT trainer to make lora-fine-tuning on llms and this error occur on mapping the dataset in SFTTrainer object, SFTTrainer tries to do automatically map your data when you try to create a trainer object and at that part if I use a little bit big dataset the code fails with this error. Note: It was working on small dataset the problem occurs when the data size is getting bigger. So in my case none of the enviroment variables did not solve the problem. I solve the problem taking the SFTTrainers dataset map function to another part of the script, doing mapping there and feed the mapped dataset to the trainer object and also passing the "skip_prepare_dataset" parameter "True" to the SFTTrainer when creating the object. Examples in below. ---- Creating SFTTrainer Object ----
trainer = SFTTrainer(
model=model,
train_dataset=dataset_mapped,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
dataset_kwargs={"skip_prepare_dataset":True}
)
---- Map Function That I Take From SFT To The Outside ----
def _prepare_non_packed_dataloader(
tokenizer,
dataset,
dataset_text_field,
max_seq_length,
formatting_func=None,
add_special_tokens=True,
remove_unused_columns=True,
):
use_formatting_func = formatting_func is not None and dataset_text_field is None
_dataset_sanity_checked = False
# Inspired from: https://huggingface.co/learn/nlp-course/chapter7/6?fw=pt
def tokenize(element):
outputs = tokenizer(
element[dataset_text_field] if not use_formatting_func else formatting_func(element),
add_special_tokens=add_special_tokens,
truncation=True,
padding=False,
max_length=max_seq_length,
return_overflowing_tokens=False,
return_length=False,
)
if use_formatting_func and not _dataset_sanity_checked:
if not isinstance(formatting_func(element), list):
raise ValueError(
"The `formatting_func` should return a list of processed strings since it can lead to silent bugs."
)
else:
_dataset_sanity_checked = True
return {"input_ids": outputs["input_ids"], "attention_mask": outputs["attention_mask"]}
signature_columns = ["input_ids", "labels", "attention_mask"]
extra_columns = list(set(dataset.column_names) - set(signature_columns))
if not remove_unused_columns and len(extra_columns) > 0:
warnings.warn(
"You passed `remove_unused_columns=False` on a non-packed dataset. This might create some issues with the default collator and yield to errors. If you want to "
f"inspect dataset other columns (in this case {extra_columns}), you can subclass `DataCollatorForLanguageModeling` in case you used the default collator and create your own data collator in order to inspect the unused dataset columns."
)
tokenized_dataset = dataset.map(
tokenize,
batched=True,
remove_columns=dataset.column_names if remove_unused_columns else None,
num_proc=None,
batch_size=1000,
)
return tokenized_dataset
dataset_mapped = _prepare_non_packed_dataloader(tokenizer, dataset, "text", max_seq_length)
I am using an Azure GPU cluster with 4 nodes each with 4 K80 GPUs (16 GPUs total)
train-env.yaml:
and
for local testing (cpu)
torchvision==0.12.0 torch==1.11.0 transformers==4.18.0
for metrics reporting/plotting
mlflow==1.25.1
azureml-mlflow==1.41.0
mlflow==2.3.2 azureml-mlflow==1.50.0 matplotlib==3.5.2 tqdm==4.64.0 psutil==5.9.0
for unit testing
pytest==7.1.2
check release notes https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html
FROM nvcr.io/nvidia/pytorch:22.04-py3
##############################################################################
NCCL TESTS
############################################################################## ENV NCCL_TESTS_TAG=v2.11.0
NOTE: adding gencodes to support K80, M60, V100, A100
RUN mkdir /tmp/nccltests && \ cd /tmp/nccltests && \ git clone -b ${NCCL_TESTS_TAG} https://github.com/NVIDIA/nccl-tests.git && \ cd nccl-tests && \ make \ MPI=1 MPI_HOME=/opt/hpcx/ompi \ NVCC_GENCODE="-gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ CUDA_HOME=/usr/local/cuda && \ cp ./build/* /usr/local/bin && \ rm -rf /tmp/nccltests
Install dependencies missing in this container
NOTE: container already has matplotlib==3.5.1 tqdm==4.62.0
COPY requirements.txt ./ RUN pip install -r requirements.txt
add ndv4-topo.xml
RUN mkdir /opt/microsoft/ ADD ./ndv4-topo.xml /opt/microsoft
to use on A100, enable env var below in your job
ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"
adjusts the level of info from NCCL tests
ENV NCCL_DEBUG="INFO" ENV NCCL_DEBUG_SUBSYS="GRAPH,INIT,ENV"
Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.
ENV NCCL_IB_PCI_RELAXED_ORDERING="1" ENV CUDA_DEVICE_ORDER="PCI_BUS_ID" ENV NCCL_SOCKET_IFNAME="eth0"
ENV NCCL_SOCKET_IFNAME='lo'
ENV NCCL_IB_DISABLE="1"