NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.07k stars 142 forks source link

[QST] Training model on custom data gets stuck near the end #726

Open Satwato opened 1 year ago

Satwato commented 1 year ago

❓ Questions & Help

I am trying to train transformers4rec on my own data, it gets stuck near the end and then times out. Running on 4 Tesla T4 GPUs. Code is pretty much the same as the examples. Just changed the data.

Details

Facing the following issue


[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1803, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807780 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1803, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807788 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1803, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807825 milliseconds before timing out.
finished
ip-:462258:462571 [0] NCCL INFO comm 0x6e95d6c0 rank 2 nranks 4 cudaDev 2 busId 1d0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
ip-:462256:462574 [0] NCCL INFO comm 0x6eb6e2d0 rank 0 nranks 4 cudaDev 0 busId 1b0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
ip-:462259:462577 [0] NCCL INFO comm 0x6f74aa40 rank 3 nranks 4 cudaDev 3 busId 1e0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
ip-:462257:462568 [0] NCCL INFO comm 0x6c1f0e50 rank 1 nranks 4 cudaDev 1 busId 1c0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 462256) of binary: /home/ubuntu/miniconda3/envs/merlin_env_2/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=========================================================
all_feat_training_multi_row_part.py FAILED
---------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-26_13:50:18
  host      : ip.ap-south-1.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 462257)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 462257
[2]:
  time      : 2023-06-26_13:50:18
  host      : ip.ap-south-1.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 462258)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 462258
[3]:
  time      : 2023-06-26_13:50:18
  host      : ip-ap-south-1.compute.internal
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 462259)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 462259
---------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-26_13:50:18
  host      : ip-.ap-south-1.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 462256)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 462256
=========================================================```
<!-- Description of your question -->
rnyak commented 1 year ago

@Satwato hello. can you give us a more info about your dataset size? how many parquet files do you have and are you training your model with sliding window approach as in our examples, meaning you split data by day?

Code is pretty much the same as the examples.are you are referencing multi-gpu example?

Are you able to run your model on a single gpu and it works fine? you can test with your custom small dataset first.

looks like there is a timeout arg , default is 30 mins (https://discuss.pytorch.org/t/multi-gpu-training-timeout-error-worknccl-optype-allgather-timeout/169435)

torch.distributed.init_process_group(backend=None, init_method=None, timeout=datetime.timedelta(seconds=1800), world_size=- 1, rank=- 1, store=None, group_name='', pg_options=None)

can you increase that and test?

Satwato commented 1 year ago

@rnyak Thanks for the reply. So I tried on various dataset sizes but get the same result. I am setting pretrained embeddings of size 128. And I am training on only a single day of data.

Yes I am referencing the multi gpu example. On single gpu example, the GPU memory keeps slowly increasing and then goes OOM. I tried on a 16 GB GPU (Tesla t4) and a 24gb one (i guess A5000) But got the OOM error on both, decreasing batchsizes just delayed the error. But, if I run the multi gpu code with --nproc_per_node 1 everything works fine.

Timeout can be a thing. Wil try. But can it be possible that one of the GPUs are not returning anything or going into a deadlock?

Satwato commented 1 year ago

Also When using a single parquet file, it works with the multigpu setup. But when using 4 of such parquet, it again gets stuck. single parquet file size is of around 70mb on disk with roughly 10000 rows when loaded to gpu using cudf, the memory_usage shows it takes 1003 mb but on checking from nvidia-smi its hows 2.2 gb

rnyak commented 1 year ago

@Satwato thanks for more info. Are you able to train your dataset with only a singe 24Gb GPU? Is your intention to speed up training time?

but on checking from nvidia-smi its hows 2.2 gb

Please note that importing torch itself also occupies GPU memory. You can visit this page for more details about memory consumption.

One thing we also noticed that if number of batches are not evenly distributed over workers, the process can be frozen. so can you set max_steps in the tr.trainer.T4RecTrainingArguments() and remove num_epochs and test multi-gpu training again?

Besides, if single parquet works on multi-gpu, can you for now proceed with that while we investigate that further?

thanks.

Satwato commented 1 year ago

@rnyak for 1 file, it only works for a small sized file (70MB) not my complete data which would be in orders of 100s of GB, it doesn't even work with a 500mb file (OOM)

as for getting stuck in training (using epochs instead of max_steps), after rewriting the whole data using nvtabular, somehow its completing all the steps (which it was not able to do ealrier) but again gets stuck as soon as the training loop ends and some function called store_flos is kicked in, which is calling dist.all_gather internally

Using max_steps resolves the freezing issue but the training reaches OOM much faster.

rnyak commented 1 year ago

@Satwato looks like it is stuck in the evaluation step? can you share the error stack what you are getting when you use max_steps?

after rewriting the whole data using nvtabular,

not sure why this was required? can you elaborate on that?

OOM issue is something else we can talk more once you share your full error msg. what docker image are you using? also can you please share your training script here? thanks.

shivamsbatra commented 1 year ago

Hi @rnyak colleague of @Satwato here,

looks like it is stuck in the evaluation step?

It didn't get stuck in evaluation as we had set the flags so that there would be no evaluation, plus on using debugger we were able to see that the script gets stuck in the inner_training_loop function

can you share the error stack what you are getting when you use max_steps?

I won't say its an error, but some memory leak which allows the model to run for a while (~ 30 mins for our case), then suddenly it gets OOM.

As for:

after rewriting the whole data using nvtabular,

not sure why this was required?

We are creating the data using pyspark, which stores it in equally sized partitions of parquet, using that dataset we were getting stuck in training loop itself, but only after rewriting the data using nvtabular.Dataset(<read_path>).to_parquet(<write_path>) and using new data we were able to exit the train loop of the Hugging Face Transformer's Trainer Class i.e out of this loop #1881

later the script halts at #2070

what docker image are you using?

Not using a docker image, we are using pip packages:

Hugging Face Transformer: 4.30.2
transformers4rec: 23.6.0
merlin-core: 23.6.0
merlin-dataloader: 23.6.0
nvtabular: 23.5.0
torch: 2.0.1

Cuda Version: 11.8

GPU: 4 * Tesla-T4 (16GiB each)

can you please share your training script here?

This is the Script we are using:

import os
import sys

os.environ["NCCL_DEBUG"]="INFO"
os.environ["NCCL_DEBUG_SUBSYS"]="ALL"
os.environ["TORCH_DISTRIBUTED_DEBUG"]="INFO"
os.environ["TRANSFORMERS_VERBOSITY"] = "debug"

import glob
import pickle
from typing import List, Optional, Union

import torch
import numpy as np
import pandas as pd

import cupy
from merlin.schema import Schema, Tags
from merlin.io import Dataset

from transformers4rec import torch as tr
from transformers4rec.torch.ranking_metric import NDCGAt, AvgPrecisionAt, RecallAt
from transformers4rec.torch.utils.examples_utils import wipe_memory
from transformers4rec.torch.utils.data_utils import MerlinDataLoader

rank = int(os.environ["LOCAL_RANK"])
cupy.cuda.Device(rank).use()
print(f"Using Rank {rank}")

train = Dataset("<INPUT_DATA_PATH>")
schema = train.schema
del train

def function_to_add_meta_info_in_schema(schema: Schema) -> Schema:
    # set tags
    # set properties (min, max, value_count, cardinality)
    pass

my_schema = function_to_add_meta_info_in_schema(schema)

max_seq_len = 30
cont_proj = 128
d_out = 128
aggr = "concat"
masking_type = "rtd"
emb_dims = {"item_id": 128}
infer_embedding_sizes = True

train_data_dir = "<TRAIN_DATA_PATH>"
eval_data_dir = "<EVAL_DATA_PATH>"

inputs = tr.TabularSequenceFeatures.from_schema(
    my_schema,
    max_sequence_length=max_seq_len,
    continuous_projection=cont_proj,
    aggregation=aggr,
    masking=masking_type,
    d_output=d_out,
    infer_embedding_sizes=infer_embedding_sizes,
    embedding_dims=emb_dims,
)

def set_pretrained_embeds(inputs):
    X = np.load("<EMBED_READ_PATH>")
    weight_dtype = inputs.categorical_module.embedding_tables[
        "item_id"
    ].weight.dtype
    pretrained_embeds = torch.tensor(X, device="cpu", dtype=weight_dtype)
    del X
    assert pretrained_embeds.shape == inputs.categorical_module.embedding_tables["item_id"].weight.shape
    with torch.no_grad():
        inputs.categorical_module.embedding_tables["item_id"].weight.copy_(
            pretrained_embeds
        )

    inputs.categorical_module.embedding_tables["item_id"].requires_grad = False
    inputs.categorical_module.embedding_tables[
        "item_id"
    ].weight.requires_grad = False

    return inputs

print("setting pre trained embedding")
inputs = set_pretrained_embeds(inputs)

try:
    batch_size = int(sys.argv[1])
except:
    batch_size = 16

# model config
num_transformer_heads = 4
num_transformer_layers = 2
mlp_units = [d_out]

# Define the config of the XLNet Transformer architecture
transformer_config = tr.XLNetConfig.build(
    d_model=d_out,
    n_head=num_transformer_heads,
    n_layer=num_transformer_layers,
    total_seq_length=max_seq_len,
)

body = tr.SequentialBlock(
    inputs,
    tr.MLPBlock(mlp_units),
    tr.TransformerBlock(transformer_config, masking=inputs.masking),
)

metrics = [
    NDCGAt(top_ks=[10, 20], labels_onehot=True),
    RecallAt(top_ks=[10, 20], labels_onehot=True),
]

# metrics = []

prediction_task = tr.NextItemPredictionTask(weight_tying=True, metrics=metrics)

head = tr.Head(
    body,
    prediction_task,
    inputs=inputs,
)

# Get the end-to-end model
# model = transformer_config.to_torch_model(inputs, prediction_task)
model = tr.Model(head)

training_args = tr.trainer.T4RecTrainingArguments(
    output_dir="./ckpt_path",
    max_sequence_length=max_seq_len,
    data_loader_engine="merlin",
    dataloader_pin_memory=False,
    dataloader_num_workers=30,
    logging_strategy="no",
    eval_accumulation_steps=None,
    max_steps=58900*4,
    dataloader_drop_last=True,
    per_device_train_batch_size=batch_size,
    weight_decay=1e-4,
    learning_rate=5e-4,
    fp16=True,
    report_to=[],
    no_cuda=False,
    local_rank=rank,
    evaluation_strategy="no",
    save_strategy="no",
    eval_steps_on_train_set=0,
)

train_paths = glob.glob(f"{train_data_dir}/*.parquet")
# eval_paths = glob.glob(f"{eval_data_dir}/*.parquet")
print("Num Training paths", len(train_paths))
# print("Num Eval paths", len(eval_data_dir))

if training_args.local_rank != -1:
    device = local_rank = training_args.local_rank
    global_size = training_args.world_size
else:
    device = local_rank = None
    global_size = None

train_loader = MerlinDataLoader.from_schema(
    my_schema,
    max_sequence_length=max_seq_len,
    paths_or_dataset=train_paths,
    batch_size=training_args.train_batch_size,
    drop_last=True,
    shuffle=True,
    reader_kwargs={"part_size": "300MB"},
    buffer_size=4,
    parts_per_chunk=2,
    row_groups_per_part=4,
    global_rank=local_rank,
    global_size=global_size,
    device = device,

)

# eval_loader = MerlinDataLoader.from_schema(
#     my_schema,
#     #         cpu=True,
#     max_sequence_length=max_seq_len,
#     paths_or_dataset=eval_paths,
#     batch_size=training_args.eval_batch_size,
#     drop_last=True,
#     shuffle=False,
#     reader_kwargs={"part_size": "300MB"},
#     buffer_size=4,
#     parts_per_chunk=2,
#     row_groups_per_part=None,
#     global_rank=local_rank,
#     global_size=global_size,
#     device = device,
# )

trainer = tr.Trainer(
    model=model,
    train_dataloader=train_loader,
    # eval_dataloader=eval_loader,
    args=training_args,
    schema=my_schema,
    compute_metrics=False,
)
print("Starting Training")
trainer.reset_lr_scheduler()
trainer.train()
trainer.state.global_step += 1
print("finished")
# trainer.evaluate()
# Save Model
model_path = "<MODEL_SAVE_PATH>"
# 
# trainer.save_model(model_path)
model.save(model_path)
# train_loader.dataset.stop()
wipe_memory()
shivamsbatra commented 1 year ago

In Short:

1) we get OOM on setting batch_size > 4 for longer training duration i.e. full data training. ( 4 on using multi-gpu, 12 on using single gpu for both epoch and max_steps training) 2) script stuck in training loop if using pyspark generated data (only on multi-gpu epoch training) 3) script stuck in store_flos if using nvtabular re-written data (only on multi-gpu epoch training)

traceback using epoch, multi-gpu using re-written dataset:

100%|█████████▉| 300/301 [01:02<00:00,  7.63it/s]<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 388 sendbuff 0x7f5e0eebec00 recvbuff 0x7f5e0eebec00 count 497280 datatype 7 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 388 sendbuff 0x7f158eebec00 recvbuff 0x7f158eebec00 count 497280 datatype 7 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255199:255540 [0] NCCL INFO AllReduce: opCount 388 sendbuff 0x7fe806ebec00 recvbuff 0x7fe806ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6e33aa50 [nranks=4] stream 0x6dfaf5b0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 388 sendbuff 0x7f9b38ebec00 recvbuff 0x7f9b38ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0
<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 389 sendbuff 0x7f5e0ea4ae00 recvbuff 0x7f5e0ea4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 389 sendbuff 0x7f158ea4ae00 recvbuff 0x7f158ea4ae00 count 270839 datatype 7 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 38a sendbuff 0x7f6087dff800 recvbuff 0x7f6087dff800 count 61 datatype 2 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 38a sendbuff 0x7f180bdff800 recvbuff 0x7f180bdff800 count 61 datatype 2 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255199:255540 [0] NCCL INFO AllReduce: opCount 389 sendbuff 0x7fe806a4ae00 recvbuff 0x7fe806a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6e33aa50 [nranks=4] stream 0x6dfaf5b0
<merlin-training>:255199:255540 [0] NCCL INFO AllReduce: opCount 38a sendbuff 0x7fea835ff800 recvbuff 0x7fea835ff800 count 61 datatype 2 op 0 root 0 comm 0x6e33aa50 [nranks=4] stream 0x6dfaf5b0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 389 sendbuff 0x7f9b38a4ae00 recvbuff 0x7f9b38a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 38a sendbuff 0x7f9db3dff800 recvbuff 0x7f9db3dff800 count 61 datatype 2 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0

100%|██████████| 301/301 [01:02<00:00,  7.59it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)

<merlin-training>:255199:255199 [0] NCCL INFO AllGather: opCount 38b sendbuff 0x7fe6058d9800 recvbuff 0x7fe6641ff600 count 4 datatype 0 op 0 root 0 comm 0x6e33aa50 [nranks=4] stream 0x6dfaf5b0
<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 38b sendbuff 0x7f5e0eebec00 recvbuff 0x7f5e0eebec00 count 497280 datatype 7 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 38b sendbuff 0x7f158eebec00 recvbuff 0x7f158eebec00 count 497280 datatype 7 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 38b sendbuff 0x7f9b38ebec00 recvbuff 0x7f9b38ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0
<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 38c sendbuff 0x7f5e0ea4ae00 recvbuff 0x7f5e0ea4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 38c sendbuff 0x7f158ea4ae00 recvbuff 0x7f158ea4ae00 count 270839 datatype 7 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255202:255555 [3] NCCL INFO AllReduce: opCount 38d sendbuff 0x7f6087dff800 recvbuff 0x7f6087dff800 count 61 datatype 2 op 0 root 0 comm 0x6e665910 [nranks=4] stream 0x6e2ae1f0
<merlin-training>:255201:255550 [2] NCCL INFO AllReduce: opCount 38d sendbuff 0x7f180bdff800 recvbuff 0x7f180bdff800 count 61 datatype 2 op 0 root 0 comm 0x70134010 [nranks=4] stream 0x6fd7c7b0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 38c sendbuff 0x7f9b38a4ae00 recvbuff 0x7f9b38a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0
<merlin-training>:255200:255546 [1] NCCL INFO AllReduce: opCount 38d sendbuff 0x7f9db3dff800 recvbuff 0x7f9db3dff800 count 61 datatype 2 op 0 root 0 comm 0x6de994f0 [nranks=4] stream 0x6dae1ca0

<merlin-training>:255199:255476 [0] transport/net_socket.cc:505 NCCL WARN NET/Socket : peer 10.10.38.252<47552> message truncated : receiving 124928 bytes instead of 65536. If you believe your socket network is in healthy state,           there may be a mismatch in collective sizes or environment settings (e.g. NCCL_PROTO, NCCL_ALGO) between ranks
<merlin-training>:255199:255476 [0] NCCL INFO include/net.h:35 -> 5
<merlin-training>:255199:255476 [0] NCCL INFO transport/net.cc:1034 -> 5
<merlin-training>:255199:255476 [0] NCCL INFO proxy.cc:520 -> 5
<merlin-training>:255199:255476 [0] NCCL INFO proxy.cc:684 -> 5 [Proxy Thread]
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

<merlin-training>:255200:255473 [1] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer <merlin-training>.ap-south-1.compute.internal<40048>
<merlin-training>:255200:255473 [1] NCCL INFO transport/net_socket.cc:493 -> 6
<merlin-training>:255200:255473 [1] NCCL INFO include/net.h:35 -> 6
<merlin-training>:255200:255473 [1] NCCL INFO transport/net.cc:1034 -> 6
<merlin-training>:255200:255473 [1] NCCL INFO proxy.cc:520 -> 6
<merlin-training>:255200:255473 [1] NCCL INFO proxy.cc:684 -> 6 [Proxy Thread]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255200 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255201 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255202 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 255199) of binary: /home/ubuntu/miniconda3/envs/merlin_env_2/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=========================================================
all_feat_training_multi_row_part.py FAILED
---------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
---------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-27_20:36:57
  host      : <merlin-training>.ap-south-1.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 255199)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 255199
=========================================================
shivamsbatra commented 1 year ago

traceback for OOM, on max_steps training:


 51%|█████     | 60300/117800 [1:25:43<1:14:17, 12.90it/s]<merlin-traiining>:313089:313394 [1] NCCL INFO AllReduce: opCount 2c2a8 sendbuff 0x7fa422ebec00 recvbuff 0x7fa422ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6bc094e0 [nranks=4] stream 0x6e78af10
<merlin-traiining>:313089:313394 [1] NCCL INFO AllReduce: opCount 2c2a9 sendbuff 0x7fa422a4ae00 recvbuff 0x7fa422a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6bc094e0 [nranks=4] stream 0x6e78af10
<merlin-traiining>:313089:313394 [1] NCCL INFO AllReduce: opCount 2c2aa sendbuff 0x7fa6b7dff800 recvbuff 0x7fa6b7dff800 count 61 datatype 2 op 0 root 0 comm 0x6bc094e0 [nranks=4] stream 0x6e78af10
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2a8 sendbuff 0x7fa3c4ebec00 recvbuff 0x7fa3c4ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2a8 sendbuff 0x7f4ef4ebec00 recvbuff 0x7f4ef4ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2a8 sendbuff 0x7fb278ebec00 recvbuff 0x7fb278ebec00 count 497280 datatype 7 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2a9 sendbuff 0x7fa3c4a4ae00 recvbuff 0x7fa3c4a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2aa sendbuff 0x7fa661dff800 recvbuff 0x7fa661dff800 count 61 datatype 2 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2a9 sendbuff 0x7f4ef4a4ae00 recvbuff 0x7f4ef4a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2aa sendbuff 0x7f51935ff800 recvbuff 0x7f51935ff800 count 61 datatype 2 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2a9 sendbuff 0x7fb278a4ae00 recvbuff 0x7fb278a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2aa sendbuff 0x7fb50fdff800 recvbuff 0x7fb50fdff800 count 61 datatype 2 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2ab sendbuff 0x7fa3c4ebec00 recvbuff 0x7fa3c4ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2ab sendbuff 0x7f4ef4ebec00 recvbuff 0x7f4ef4ebec00 count 497280 datatype 7 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2ab sendbuff 0x7fb278ebec00 recvbuff 0x7fb278ebec00 count 497280 datatype 7 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2ac sendbuff 0x7fa3c4a4ae00 recvbuff 0x7fa3c4a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313090:313398 [2] NCCL INFO AllReduce: opCount 2c2ad sendbuff 0x7fa661dff800 recvbuff 0x7fa661dff800 count 61 datatype 2 op 0 root 0 comm 0x6e322200 [nranks=4] stream 0x6bd06590
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2ac sendbuff 0x7f4ef4a4ae00 recvbuff 0x7f4ef4a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313088:313392 [0] NCCL INFO AllReduce: opCount 2c2ad sendbuff 0x7f51935ff800 recvbuff 0x7f51935ff800 count 61 datatype 2 op 0 root 0 comm 0x6ad521a0 [nranks=4] stream 0x6e6a1730
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2ac sendbuff 0x7fb278a4ae00 recvbuff 0x7fb278a4ae00 count 270839 datatype 7 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
<merlin-traiining>:313091:313403 [3] NCCL INFO AllReduce: opCount 2c2ad sendbuff 0x7fb50fdff800 recvbuff 0x7fb50fdff800 count 61 datatype 2 op 0 root 0 comm 0x70c33e90 [nranks=4] stream 0x718e34a0
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 332, in _get_next_batch
    batch = next(self._batch_itr)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data_ext/user-files/shivam.batra/merlin/all_feat_training_multi_row_part.py", line 275, in <module>
    trainer.train()
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/transformers/trainer.py", line 1916, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/torch.py", line 64, in __next__
    converted_batch = self.convert_batch(super().__next__())
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 261, in __next__
    return self._get_next_batch()
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 343, in _get_next_batch
    self._fetch_chunk()
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 277, in _fetch_chunk
    raise chunks
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 791, in load_chunks
    self.chunk_logic(itr)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/dataloader/loader_base.py", line 770, in chunk_logic
    chunks = shuffle_df(chunks, keep_index=True)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/merlin/io/shuffle.py", line 75, in shuffle_df
    return df.sample(n=size, ignore_index=not keep_index)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/cudf/core/indexed_frame.py", line 3285, in sample
    return self._sample_axis_0(
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/cudf/core/indexed_frame.py", line 3315, in _sample_axis_0
    return self._gather(
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/cudf/core/indexed_frame.py", line 1748, in _gather
    libcudf.copying.gather(
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "copying.pyx", line 187, in cudf._lib.copying.gather
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /__w/rmm/rmm/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory
<merlin-traiining>:313089:313348 [1] NCCL INFO [Service thread] Connection closed by localRank 1
<merlin-traiining>:313089:313089 [1] NCCL INFO comm 0x6bc094e0 rank 1 nranks 4 cudaDev 1 busId 1c0 - Abort COMPLETE
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 313088 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 313090 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 313091 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 313089) of binary: /home/ubuntu/miniconda3/envs/merlin_env_2/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/miniconda3/envs/merlin_env_2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
all_feat_training_multi_row_part.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-28_02:24:03
  host      : <merlin-traiining>.ap-south-1.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 313089)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
rnyak commented 1 year ago

@shivamsbatra script stuck in training loop if using pyspark generated data

we are currently looking into issue of using multiple parquet files with multi-gpu.

shivamsbatra commented 1 year ago

are you still using NVTabular to transform your data? If NO, how did you create your schema file?

Yes, I am using NVTabular's re-written data for training now,

earlier I was creating the schema using following logic:


import pickle
from typing import List
from merlin.schema import Schema, Tags
from merlin.io import Dataset

def get_schema(train_path: str, meta_path: str, selected_columns: List[str]=None) -> Schema : 

    train_schema = Dataset(train_path).schema # sample train file path
    col_meta = pickle.load(open("meta_path", "r")) # meta info for columns
    seq_len = col_meta["default"]["sequence_length"] # sequence length of list columns
    if not selected_columns:
        SELECTED_COLS = [] # list of column names for training
    else:
        SELECTED_COLS = selected_columns
    col_schema_list = []
    for col, col_schema in train_schema.select_by_name(SELECTED_COLS).column_schemas.items():
        tag_list = []
        new_properties = {}
        if col_meta[col]["is_list"]:
            tag_list.append(Tags.LIST)
            new_properties["value_count"] = {
                "min": seq_len,
                "max": seq_len,
            }

        if col_meta[col]["is_categorical"]:
            tag_list.append(Tags.CATEGORICAL)

            new_properties["start_index"] = 1.0
            new_properties["domain"] = {
                "min": 0.0,
                "max": col_meta[col]["cardinality"],
                "name": col,
            }
        else:
            tag_list.append(Tags.CONTINUOUS)

        if col == "item_id":
            tag_list.extend([Tags.ITEM, Tags.ID, Tags.ITEM_ID])
        elif col == "session_id":
            tag_list.extend([Tags.SESSION, Tags.ID])

        col_schema_list.append(
            col_schema.with_properties(new_properties).with_tags(tag_list)
        )

    return Schema(col_schema_list)