Open Satwato opened 1 year ago
@Satwato hello. can you give us a more info about your dataset size? how many parquet files do you have and are you training your model with sliding window approach as in our examples, meaning you split data by day?
Code is pretty much the same as the examples.
are you are referencing multi-gpu example?
Are you able to run your model on a single gpu and it works fine? you can test with your custom small dataset first.
looks like there is a timeout arg , default is 30 mins (https://discuss.pytorch.org/t/multi-gpu-training-timeout-error-worknccl-optype-allgather-timeout/169435)
torch.distributed.init_process_group(backend=None, init_method=None, timeout=datetime.timedelta(seconds=1800), world_size=- 1, rank=- 1, store=None, group_name='', pg_options=None)
can you increase that and test?
@rnyak Thanks for the reply. So I tried on various dataset sizes but get the same result. I am setting pretrained embeddings of size 128. And I am training on only a single day of data.
Yes I am referencing the multi gpu example.
On single gpu example, the GPU memory keeps slowly increasing and then goes OOM. I tried on a 16 GB GPU (Tesla t4) and a 24gb one (i guess A5000) But got the OOM error on both, decreasing batchsizes just delayed the error.
But, if I run the multi gpu code with --nproc_per_node 1
everything works fine.
Timeout can be a thing. Wil try. But can it be possible that one of the GPUs are not returning anything or going into a deadlock?
Also When using a single parquet file, it works with the multigpu setup. But when using 4 of such parquet, it again gets stuck. single parquet file size is of around 70mb on disk with roughly 10000 rows when loaded to gpu using cudf, the memory_usage shows it takes 1003 mb but on checking from nvidia-smi its hows 2.2 gb
@Satwato thanks for more info. Are you able to train your dataset with only a singe 24Gb GPU? Is your intention to speed up training time?
but on checking from nvidia-smi its hows 2.2 gb
Please note that importing torch itself also occupies GPU memory. You can visit this page for more details about memory consumption.
One thing we also noticed that if number of batches are not evenly distributed over workers, the process can be frozen. so can you set max_steps
in the tr.trainer.T4RecTrainingArguments() and remove num_epochs and test multi-gpu training again?
Besides, if single parquet works on multi-gpu, can you for now proceed with that while we investigate that further?
thanks.
@rnyak for 1 file, it only works for a small sized file (70MB) not my complete data which would be in orders of 100s of GB, it doesn't even work with a 500mb file (OOM)
as for getting stuck in training (using epochs instead of max_steps), after rewriting the whole data using nvtabular, somehow its completing all the steps (which it was not able to do ealrier) but again gets stuck as soon as the training loop ends and some function called store_flos is kicked in, which is calling dist.all_gather internally
Using max_steps resolves the freezing issue but the training reaches OOM much faster.
@Satwato looks like it is stuck in the evaluation step? can you share the error stack what you are getting when you use max_steps
?
after rewriting the whole data using nvtabular,
not sure why this was required? can you elaborate on that?
OOM issue is something else we can talk more once you share your full error msg. what docker image are you using? also can you please share your training script here? thanks.
Hi @rnyak colleague of @Satwato here,
looks like it is stuck in the evaluation step?
It didn't get stuck in evaluation as we had set the flags so that there would be no evaluation, plus on using debugger we were able to see that the script gets stuck in the inner_training_loop
function
can you share the error stack what you are getting when you use
max_steps
?
I won't say its an error, but some memory leak which allows the model to run for a while (~ 30 mins for our case), then suddenly it gets OOM.
As for:
after rewriting the whole data using nvtabular,
not sure why this was required?
We are creating the data using pyspark, which stores it in equally sized partitions of parquet, using that dataset we were getting stuck in training loop itself, but only after rewriting the data using nvtabular.Dataset(<read_path>).to_parquet(<write_path>)
and using new data we were able to exit the train loop of the Hugging Face Transformer's Trainer
Class i.e out of this loop #1881
later the script halts at #2070
what docker image are you using?
Not using a docker image, we are using pip packages:
Hugging Face Transformer: 4.30.2
transformers4rec: 23.6.0
merlin-core: 23.6.0
merlin-dataloader: 23.6.0
nvtabular: 23.5.0
torch: 2.0.1
Cuda Version: 11.8
GPU: 4 * Tesla-T4 (16GiB each)
can you please share your training script here?
This is the Script we are using:
import os
import sys
os.environ["NCCL_DEBUG"]="INFO"
os.environ["NCCL_DEBUG_SUBSYS"]="ALL"
os.environ["TORCH_DISTRIBUTED_DEBUG"]="INFO"
os.environ["TRANSFORMERS_VERBOSITY"] = "debug"
import glob
import pickle
from typing import List, Optional, Union
import torch
import numpy as np
import pandas as pd
import cupy
from merlin.schema import Schema, Tags
from merlin.io import Dataset
from transformers4rec import torch as tr
from transformers4rec.torch.ranking_metric import NDCGAt, AvgPrecisionAt, RecallAt
from transformers4rec.torch.utils.examples_utils import wipe_memory
from transformers4rec.torch.utils.data_utils import MerlinDataLoader
rank = int(os.environ["LOCAL_RANK"])
cupy.cuda.Device(rank).use()
print(f"Using Rank {rank}")
train = Dataset("<INPUT_DATA_PATH>")
schema = train.schema
del train
def function_to_add_meta_info_in_schema(schema: Schema) -> Schema:
# set tags
# set properties (min, max, value_count, cardinality)
pass
my_schema = function_to_add_meta_info_in_schema(schema)
max_seq_len = 30
cont_proj = 128
d_out = 128
aggr = "concat"
masking_type = "rtd"
emb_dims = {"item_id": 128}
infer_embedding_sizes = True
train_data_dir = "<TRAIN_DATA_PATH>"
eval_data_dir = "<EVAL_DATA_PATH>"
inputs = tr.TabularSequenceFeatures.from_schema(
my_schema,
max_sequence_length=max_seq_len,
continuous_projection=cont_proj,
aggregation=aggr,
masking=masking_type,
d_output=d_out,
infer_embedding_sizes=infer_embedding_sizes,
embedding_dims=emb_dims,
)
def set_pretrained_embeds(inputs):
X = np.load("<EMBED_READ_PATH>")
weight_dtype = inputs.categorical_module.embedding_tables[
"item_id"
].weight.dtype
pretrained_embeds = torch.tensor(X, device="cpu", dtype=weight_dtype)
del X
assert pretrained_embeds.shape == inputs.categorical_module.embedding_tables["item_id"].weight.shape
with torch.no_grad():
inputs.categorical_module.embedding_tables["item_id"].weight.copy_(
pretrained_embeds
)
inputs.categorical_module.embedding_tables["item_id"].requires_grad = False
inputs.categorical_module.embedding_tables[
"item_id"
].weight.requires_grad = False
return inputs
print("setting pre trained embedding")
inputs = set_pretrained_embeds(inputs)
try:
batch_size = int(sys.argv[1])
except:
batch_size = 16
# model config
num_transformer_heads = 4
num_transformer_layers = 2
mlp_units = [d_out]
# Define the config of the XLNet Transformer architecture
transformer_config = tr.XLNetConfig.build(
d_model=d_out,
n_head=num_transformer_heads,
n_layer=num_transformer_layers,
total_seq_length=max_seq_len,
)
body = tr.SequentialBlock(
inputs,
tr.MLPBlock(mlp_units),
tr.TransformerBlock(transformer_config, masking=inputs.masking),
)
metrics = [
NDCGAt(top_ks=[10, 20], labels_onehot=True),
RecallAt(top_ks=[10, 20], labels_onehot=True),
]
# metrics = []
prediction_task = tr.NextItemPredictionTask(weight_tying=True, metrics=metrics)
head = tr.Head(
body,
prediction_task,
inputs=inputs,
)
# Get the end-to-end model
# model = transformer_config.to_torch_model(inputs, prediction_task)
model = tr.Model(head)
training_args = tr.trainer.T4RecTrainingArguments(
output_dir="./ckpt_path",
max_sequence_length=max_seq_len,
data_loader_engine="merlin",
dataloader_pin_memory=False,
dataloader_num_workers=30,
logging_strategy="no",
eval_accumulation_steps=None,
max_steps=58900*4,
dataloader_drop_last=True,
per_device_train_batch_size=batch_size,
weight_decay=1e-4,
learning_rate=5e-4,
fp16=True,
report_to=[],
no_cuda=False,
local_rank=rank,
evaluation_strategy="no",
save_strategy="no",
eval_steps_on_train_set=0,
)
train_paths = glob.glob(f"{train_data_dir}/*.parquet")
# eval_paths = glob.glob(f"{eval_data_dir}/*.parquet")
print("Num Training paths", len(train_paths))
# print("Num Eval paths", len(eval_data_dir))
if training_args.local_rank != -1:
device = local_rank = training_args.local_rank
global_size = training_args.world_size
else:
device = local_rank = None
global_size = None
train_loader = MerlinDataLoader.from_schema(
my_schema,
max_sequence_length=max_seq_len,
paths_or_dataset=train_paths,
batch_size=training_args.train_batch_size,
drop_last=True,
shuffle=True,
reader_kwargs={"part_size": "300MB"},
buffer_size=4,
parts_per_chunk=2,
row_groups_per_part=4,
global_rank=local_rank,
global_size=global_size,
device = device,
)
# eval_loader = MerlinDataLoader.from_schema(
# my_schema,
# # cpu=True,
# max_sequence_length=max_seq_len,
# paths_or_dataset=eval_paths,
# batch_size=training_args.eval_batch_size,
# drop_last=True,
# shuffle=False,
# reader_kwargs={"part_size": "300MB"},
# buffer_size=4,
# parts_per_chunk=2,
# row_groups_per_part=None,
# global_rank=local_rank,
# global_size=global_size,
# device = device,
# )
trainer = tr.Trainer(
model=model,
train_dataloader=train_loader,
# eval_dataloader=eval_loader,
args=training_args,
schema=my_schema,
compute_metrics=False,
)
print("Starting Training")
trainer.reset_lr_scheduler()
trainer.train()
trainer.state.global_step += 1
print("finished")
# trainer.evaluate()
# Save Model
model_path = "<MODEL_SAVE_PATH>"
#
# trainer.save_model(model_path)
model.save(model_path)
# train_loader.dataset.stop()
wipe_memory()
In Short:
1) we get OOM on setting batch_size > 4 for longer training duration i.e. full data training. ( 4 on using multi-gpu, 12 on using single gpu for both epoch
and max_steps
training)
2) script stuck in training loop if using pyspark generated data (only on multi-gpu epoch
training)
3) script stuck in store_flos
if using nvtabular re-written data (only on multi-gpu epoch
training)
traceback using epoch, multi-gpu using re-written dataset:
100%|█████████▉| 300/301 [01:02<00:00, 7.63it/s]<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 388 sendbuff 0x7f5e0eebec00 recvbuff 0x7f5e0eebec00 count 497280 datatype 7 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 388 sendbuff 0x7f158eebec00 recvbuff 0x7f158eebec00 count 497280 datatype 7 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255199:255540 [0] NCCL INFO AllReduce: opCount 388 sendbuff 0x7fe806ebec00 recvbuff 0x7fe806ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6e33aa50 [nranks=4] stream 0x6dfaf5b0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 388 sendbuff 0x7f9b38ebec00 recvbuff 0x7f9b38ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0
<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 389 sendbuff 0x7f5e0ea4ae00 recvbuff 0x7f5e0ea4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 389 sendbuff 0x7f158ea4ae00 recvbuff 0x7f158ea4ae00 count 270839 datatype 7 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 38a sendbuff 0x7f6087dff800 recvbuff 0x7f6087dff800 count 61 datatype 2 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 38a sendbuff 0x7f180bdff800 recvbuff 0x7f180bdff800 count 61 datatype 2 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255199:255540 [0] NCCL INFO AllReduce: opCount 389 sendbuff 0x7fe806a4ae00 recvbuff 0x7fe806a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6e33aa50 [nranks=4] stream 0x6dfaf5b0
<merlin-training>:255199:255540 [0] NCCL INFO AllReduce: opCount 38a sendbuff 0x7fea835ff800 recvbuff 0x7fea835ff800 count 61 datatype 2 op 0 root 0 comm 0x6e33aa50 [nranks=4] stream 0x6dfaf5b0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 389 sendbuff 0x7f9b38a4ae00 recvbuff 0x7f9b38a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 38a sendbuff 0x7f9db3dff800 recvbuff 0x7f9db3dff800 count 61 datatype 2 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0
100%|██████████| 301/301 [01:02<00:00, 7.59it/s]
Training completed. Do not forget to share your model on huggingface.co/models =)
<merlin-training>:255199:255199 [0] NCCL INFO AllGather: opCount 38b sendbuff 0x7fe6058d9800 recvbuff 0x7fe6641ff600 count 4 datatype 0 op 0 root 0 comm 0x6e33aa50 [nranks=4] stream 0x6dfaf5b0
<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 38b sendbuff 0x7f5e0eebec00 recvbuff 0x7f5e0eebec00 count 497280 datatype 7 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 38b sendbuff 0x7f158eebec00 recvbuff 0x7f158eebec00 count 497280 datatype 7 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 38b sendbuff 0x7f9b38ebec00 recvbuff 0x7f9b38ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0
<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 38c sendbuff 0x7f5e0ea4ae00 recvbuff 0x7f5e0ea4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 38c sendbuff 0x7f158ea4ae00 recvbuff 0x7f158ea4ae00 count 270839 datatype 7 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 38d sendbuff 0x7f6087dff800 recvbuff 0x7f6087dff800 count 61 datatype 2 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 38d sendbuff 0x7f180bdff800 recvbuff 0x7f180bdff800 count 61 datatype 2 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 38c sendbuff 0x7f9b38a4ae00 recvbuff 0x7f9b38a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 38d sendbuff 0x7f9db3dff800 recvbuff 0x7f9db3dff800 count 61 datatype 2 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0
<merlin-training>:255199:255476 [0] transport/net_socket.cc:505 NCCL WARN NET/Socket : peer 10.10.38.252<47552> message truncated : receiving 124928 bytes instead of 65536. If you believe your socket network is in healthy state, there may be a mismatch in collective sizes or environment settings (e.g. NCCL_PROTO, NCCL_ALGO) between ranks
<merlin-training>:255199:255476 [0] NCCL INFO include/net.h:35 -> 5
<merlin-training>:255199:255476 [0] NCCL INFO transport/net.cc:1034 -> 5
<merlin-training>:255199:255476 [0] NCCL INFO proxy.cc:520 -> 5
<merlin-training>:255199:255476 [0] NCCL INFO proxy.cc:684 -> 5 [Proxy Thread]
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
<merlin-training>:255200:255473 [1] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer <merlin-training>.ap-south-1.compute.internal<40048>
<merlin-training>:255200:255473 [1] NCCL INFO transport/net_socket.cc:493 -> 6
<merlin-training>:255200:255473 [1] NCCL INFO include/net.h:35 -> 6
<merlin-training>:255200:255473 [1] NCCL INFO transport/net.cc:1034 -> 6
<merlin-training>:255200:255473 [1] NCCL INFO proxy.cc:520 -> 6
<merlin-training>:255200:255473 [1] NCCL INFO proxy.cc:684 -> 6 [Proxy Thread]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255200 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255201 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255202 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 255199) of binary: /home/ubuntu/miniconda3/envs/merlin_env_2/bin/python
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/merlin_env_2/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=========================================================
all_feat_training_multi_row_part.py FAILED
---------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
---------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-06-27_20:36:57
host : <merlin-training>.ap-south-1.compute.internal
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 255199)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 255199
=========================================================
traceback for OOM, on max_steps training:
51%|█████ | 60300/117800 [1:25:43<1:14:17, 12.90it/s]<merlin-traiining>:313089:313394 [1] NCCL INFO AllReduce: opCount 2c2a8 sendbuff 0x7fa422ebec00 recvbuff 0x7fa422ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6bc094e0 [nranks=4] stream 0x6e78af10
<merlin-traiining>:313089:313394 [1] NCCL INFO AllReduce: opCount 2c2a9 sendbuff 0x7fa422a4ae00 recvbuff 0x7fa422a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6bc094e0 [nranks=4] stream 0x6e78af10
<merlin-traiining>:313089:313394 [1] NCCL INFO AllReduce: opCount 2c2aa sendbuff 0x7fa6b7dff800 recvbuff 0x7fa6b7dff800 count 61 datatype 2 op 0 root 0 comm 0x6bc094e0 [nranks=4] stream 0x6e78af10
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2a8 sendbuff 0x7fa3c4ebec00 recvbuff 0x7fa3c4ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2a8 sendbuff 0x7f4ef4ebec00 recvbuff 0x7f4ef4ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2a8 sendbuff 0x7fb278ebec00 recvbuff 0x7fb278ebec00 count 497280 datatype 7 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2a9 sendbuff 0x7fa3c4a4ae00 recvbuff 0x7fa3c4a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2aa sendbuff 0x7fa661dff800 recvbuff 0x7fa661dff800 count 61 datatype 2 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2a9 sendbuff 0x7f4ef4a4ae00 recvbuff 0x7f4ef4a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2aa sendbuff 0x7f51935ff800 recvbuff 0x7f51935ff800 count 61 datatype 2 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2a9 sendbuff 0x7fb278a4ae00 recvbuff 0x7fb278a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2aa sendbuff 0x7fb50fdff800 recvbuff 0x7fb50fdff800 count 61 datatype 2 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2ab sendbuff 0x7fa3c4ebec00 recvbuff 0x7fa3c4ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2ab sendbuff 0x7f4ef4ebec00 recvbuff 0x7f4ef4ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2ab sendbuff 0x7fb278ebec00 recvbuff 0x7fb278ebec00 count 497280 datatype 7 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2ac sendbuff 0x7fa3c4a4ae00 recvbuff 0x7fa3c4a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2ad sendbuff 0x7fa661dff800 recvbuff 0x7fa661dff800 count 61 datatype 2 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2ac sendbuff 0x7f4ef4a4ae00 recvbuff 0x7f4ef4a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2ad sendbuff 0x7f51935ff800 recvbuff 0x7f51935ff800 count 61 datatype 2 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2ac sendbuff 0x7fb278a4ae00 recvbuff 0x7fb278a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2ad sendbuff 0x7fb50fdff800 recvbuff 0x7fb50fdff800 count 61 datatype 2 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 332, in _get_next_batch
batch = next(self._batch_itr)
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data_ext/user-files/shivam.batra/merlin/all_feat_training_multi_row_part.py", line 275, in <module>
trainer.train()
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/transformers/trainer.py", line 1916, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
data = self._next_data()
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/torch.py", line 64, in __next__
converted_batch = self.convert_batch(super().__next__())
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 261, in __next__
return self._get_next_batch()
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 343, in _get_next_batch
self._fetch_chunk()
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 277, in _fetch_chunk
raise chunks
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 791, in load_chunks
self.chunk_logic(itr)
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/nvtx/nvtx.py", line 101, in inner
result = func(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 770, in chunk_logic
chunks = shuffle_df(chunks, keep_index=True)
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/io/shuffle.py", line 75, in shuffle_df
return df.sample(n=size, ignore_index=not keep_index)
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/nvtx/nvtx.py", line 101, in inner
result = func(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/cudf/core/indexed_frame.py", line 3285, in sample
return self._sample_axis_0(
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/cudf/core/indexed_frame.py", line 3315, in _sample_axis_0
return self._gather(
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/cudf/core/indexed_frame.py", line 1748, in _gather
libcudf.copying.gather(
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "copying.pyx", line 187, in cudf._lib.copying.gather
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /__w/rmm/rmm/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory
<merlin-traiining>:313089:313348 [1] NCCL INFO [Service thread] Connection closed by localRank 1
<merlin-traiining>:313089:313089 [1] NCCL INFO comm 0x6bc094e0 rank 1 nranks 4 cudaDev 1 busId 1c0 - Abort COMPLETE
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 313088 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 313090 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 313091 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 313089) of binary: /home/ubuntu/miniconda3/envs/merlin_env_2/bin/python
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/merlin_env_2/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
all_feat_training_multi_row_part.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-06-28_02:24:03
host : <merlin-traiining>.ap-south-1.compute.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 313089)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@shivamsbatra script stuck in training loop if using pyspark generated data
we are currently looking into issue of using multiple parquet files with multi-gpu.
are you still using NVTabular to transform your data? If NO, how did you create your schema file?
Yes, I am using NVTabular's re-written data for training now,
earlier I was creating the schema using following logic:
import pickle
from typing import List
from merlin.schema import Schema, Tags
from merlin.io import Dataset
def get_schema(train_path: str, meta_path: str, selected_columns: List[str]=None) -> Schema :
train_schema = Dataset(train_path).schema # sample train file path
col_meta = pickle.load(open("meta_path", "r")) # meta info for columns
seq_len = col_meta["default"]["sequence_length"] # sequence length of list columns
if not selected_columns:
SELECTED_COLS = [] # list of column names for training
else:
SELECTED_COLS = selected_columns
col_schema_list = []
for col, col_schema in train_schema.select_by_name(SELECTED_COLS).column_schemas.items():
tag_list = []
new_properties = {}
if col_meta[col]["is_list"]:
tag_list.append(Tags.LIST)
new_properties["value_count"] = {
"min": seq_len,
"max": seq_len,
}
if col_meta[col]["is_categorical"]:
tag_list.append(Tags.CATEGORICAL)
new_properties["start_index"] = 1.0
new_properties["domain"] = {
"min": 0.0,
"max": col_meta[col]["cardinality"],
"name": col,
}
else:
tag_list.append(Tags.CONTINUOUS)
if col == "item_id":
tag_list.extend([Tags.ITEM, Tags.ID, Tags.ITEM_ID])
elif col == "session_id":
tag_list.extend([Tags.SESSION, Tags.ID])
col_schema_list.append(
col_schema.with_properties(new_properties).with_tags(tag_list)
)
return Schema(col_schema_list)
❓ Questions & Help
I am trying to train transformers4rec on my own data, it gets stuck near the end and then times out. Running on 4 Tesla T4 GPUs. Code is pretty much the same as the examples. Just changed the data.
Details
Facing the following issue