Loss is nan, stopping training, while trying to reproduce alpaca_finetuning_v1 results.

I'm using 2 NVIDIA GeForce RTX 3090 GPUs. (Memory: 24576MiB each)

I'm try to fine-tune the model with alpaca_data which is provided to replicated the results of this paper. i've put batch size as 2 because 4 gives me this error:

torch.cuda.OutOfMemoryError: CUDA out of memory.

Command:

!OMP_NUM_THREADS=8 torchrun --nproc_per_node 2 finetuning.py \
    --model Llama7B_adapter \
    --llama_model_path /home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/TARGET/ \
    --data_path ./alpaca_data.json \
    --adapter_layer 30 \
    --adapter_len 10 \
    --max_seq_len 512 \
    --batch_size 2 \
    --epochs 5 \
    --warmup_epochs 2 \
    --blr 9e-3 \
    --weight_decay 0.02 \
    --output_dir ./checkpoint/

Logs:

| distributed init (rank 0): env://, gpu 0
| distributed init (rank 1): env://, gpu 1
[17:10:32.992362] job dir: /home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/alpaca_finetuning_v1
[17:10:32.992448] Namespace(batch_size=2,
epochs=5,
accum_iter=1,
llama_model_path='/home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/TARGET/',
model='Llama7B_adapter',
adapter_layer=30,
adapter_len=10,
max_seq_len=512,
weight_decay=0.02,
lr=None,
blr=0.009,
min_lr=0.0,
warmup_epochs=2,
data_path='./alpaca_data.json',
output_dir='./checkpoint/',
log_dir='./output_dir',
device='cuda',
seed=0,
resume='',
start_epoch=0,
num_workers=10,
pin_mem=True,
world_size=2,
local_rank=-1,
dist_on_itp=False,
dist_url='env://',
rank=0,
gpu=0,
distributed=True,
dist_backend='nccl')
[17:10:33.198872] <__main__.InstructionDataset object at 0x7fdcb9bd85e0>
[17:10:33.198914] <__main__.InstructionDataset object at 0x7fdc2dab6da0>
[17:10:33.198952] Sampler_train = <torch.utils.data.distributed.DistributedSampler object at 0x7fdc2dab6d70>
[17:10:39.061337] /home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/TARGET/7B/consolidated.00.pth
/home/navaneeth/.local/lib/python3.10/site-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
/home/navaneeth/.local/lib/python3.10/site-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
[17:10:44.010087] Model = Transformer(
  (tok_embeddings): Embedding(32000, 4096)
  (adapter_query): Embedding(300, 4096)
  (criterion): CrossEntropyLoss()
  (layers): ModuleList(
    (0-31): 32 x TransformerBlock(
      (attention): Attention(
        (wq): Linear(in_features=4096, out_features=4096, bias=False)
        (wk): Linear(in_features=4096, out_features=4096, bias=False)
        (wv): Linear(in_features=4096, out_features=4096, bias=False)
        (wo): Linear(in_features=4096, out_features=4096, bias=False)
      )
      (feed_forward): FeedForward(
        (w1): Linear(in_features=4096, out_features=11008, bias=False)
        (w2): Linear(in_features=11008, out_features=4096, bias=False)
        (w3): Linear(in_features=4096, out_features=11008, bias=False)
      )
      (attention_norm): RMSNorm()
      (ffn_norm): RMSNorm()
    )
  )
  (norm): RMSNorm()
  (output): Linear(in_features=4096, out_features=32000, bias=False)
)
[17:10:44.010150] base lr: 9.00e-03
[17:10:44.010164] actual lr: 1.41e-04
[17:10:44.010175] accumulate grad iterations: 1
[17:10:44.010185] effective batch size: 4
[17:10:46.081588] AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.000140625
    maximize: False
    weight_decay: 0.0

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.000140625
    maximize: False
    weight_decay: 0.02
)
[17:10:46.081681] Start training for 5 epochs
[17:10:46.082621] log_dir: ./output_dir
[W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[17:10:47.481772] Epoch: [0]  [    0/13000]  eta: 5:03:00  lr: 0.000000  closs: 1.5186 (1.5186)  time: 1.3985  data: 0.3370  max mem: 20596
[17:10:53.607369] Epoch: [0]  [   10/13000]  eta: 2:28:04  lr: 0.000000  closs: 1.3838 (1.5683)  time: 0.6840  data: 0.0307  max mem: 20618
[17:10:59.724177] Epoch: [0]  [   20/13000]  eta: 2:20:30  lr: 0.000000  closs: 1.3838 (1.6151)  time: 0.6121  data: 0.0001  max mem: 20618
[17:11:05.859398] Epoch: [0]  [   30/13000]  eta: 2:17:53  lr: 0.000000  closs: 1.5078 (1.6011)  time: 0.6125  data: 0.0001  max mem: 20618
[17:11:12.016458] Epoch: [0]  [   40/13000]  eta: 2:16:36  lr: 0.000000  closs: 1.5508 (1.6249)  time: 0.6146  data: 0.0001  max mem: 20618
[17:11:18.185472] Epoch: [0]  [   50/13000]  eta: 2:15:50  lr: 0.000000  closs: 1.5430 (1.6123)  time: 0.6163  data: 0.0001  max mem: 20618
[17:11:24.369595] Epoch: [0]  [   60/13000]  eta: 2:15:21  lr: 0.000000  closs: 1.4756 (1.6185)  time: 0.6176  data: 0.0001  max mem: 20618
[17:11:30.565958] Epoch: [0]  [   70/13000]  eta: 2:15:00  lr: 0.000000  closs: 1.4609 (1.5902)  time: 0.6190  data: 0.0001  max mem: 20618
[17:11:36.784317] Epoch: [0]  [   80/13000]  eta: 2:14:46  lr: 0.000000  closs: 1.4434 (1.5902)  time: 0.6207  data: 0.0001  max mem: 20618
[17:11:43.015473] Epoch: [0]  [   90/13000]  eta: 2:14:36  lr: 0.000000  closs: 1.5264 (1.5991)  time: 0.6224  data: 0.0001  max mem: 20618
[17:11:49.250349] Epoch: [0]  [  100/13000]  eta: 2:14:27  lr: 0.000001  closs: 1.5889 (1.6096)  time: 0.6232  data: 0.0001  max mem: 20618
[17:11:55.489798] Epoch: [0]  [  110/13000]  eta: 2:14:19  lr: 0.000001  closs: 1.6016 (1.6066)  time: 0.6237  data: 0.0001  max mem: 20618
[17:12:01.741387] Epoch: [0]  [  120/13000]  eta: 2:14:12  lr: 0.000001  closs: 1.4668 (1.6057)  time: 0.6245  data: 0.0001  max mem: 20618
[17:12:07.992549] Epoch: [0]  [  130/13000]  eta: 2:14:06  lr: 0.000001  closs: 1.5322 (1.6070)  time: 0.6251  data: 0.0001  max mem: 20618
[17:12:14.251213] Epoch: [0]  [  140/13000]  eta: 2:14:00  lr: 0.000001  closs: 1.5322 (1.6222)  time: 0.6254  data: 0.0001  max mem: 20618
[17:12:20.515632] Epoch: [0]  [  150/13000]  eta: 2:13:55  lr: 0.000001  closs: 1.5615 (1.6169)  time: 0.6261  data: 0.0001  max mem: 20618
[17:12:26.784347] Epoch: [0]  [  160/13000]  eta: 2:13:50  lr: 0.000001  closs: 1.3867 (1.6352)  time: 0.6266  data: 0.0001  max mem: 20618
[17:12:33.056327] Epoch: [0]  [  170/13000]  eta: 2:13:45  lr: 0.000001  closs: 1.3955 (1.6270)  time: 0.6270  data: 0.0001  max mem: 20618
[17:12:39.323410] Epoch: [0]  [  180/13000]  eta: 2:13:39  lr: 0.000001  closs: 1.4873 (1.6272)  time: 0.6269  data: 0.0001  max mem: 20618
[17:12:45.595449] Epoch: [0]  [  190/13000]  eta: 2:13:34  lr: 0.000001  closs: 1.3467 (1.6120)  time: 0.6269  data: 0.0001  max mem: 20618
[17:12:51.868360] Epoch: [0]  [  200/13000]  eta: 2:13:29  lr: 0.000001  closs: 1.3467 (1.6073)  time: 0.6272  data: 0.0001  max mem: 20618
[17:12:58.145625] Epoch: [0]  [  210/13000]  eta: 2:13:24  lr: 0.000001  closs: 1.3662 (1.6042)  time: 0.6275  data: 0.0001  max mem: 20618
[17:13:04.422560] Epoch: [0]  [  220/13000]  eta: 2:13:19  lr: 0.000001  closs: 1.4150 (1.5964)  time: 0.6277  data: 0.0001  max mem: 20618
[17:13:10.699249] Epoch: [0]  [  230/13000]  eta: 2:13:13  lr: 0.000001  closs: 1.4043 (1.6040)  time: 0.6276  data: 0.0001  max mem: 20618
[17:13:16.974919] Epoch: [0]  [  240/13000]  eta: 2:13:08  lr: 0.000001  closs: 1.4766 (1.6143)  time: 0.6276  data: 0.0001  max mem: 20618
[17:13:23.254382] Epoch: [0]  [  250/13000]  eta: 2:13:03  lr: 0.000001  closs: 1.6289 (1.6176)  time: 0.6277  data: 0.0001  max mem: 20618
[17:13:29.533807] Epoch: [0]  [  260/13000]  eta: 2:12:57  lr: 0.000001  closs: 1.4199 (1.6184)  time: 0.6279  data: 0.0001  max mem: 20618
[17:13:35.815718] Epoch: [0]  [  270/13000]  eta: 2:12:52  lr: 0.000001  closs: 1.3975 (1.6182)  time: 0.6280  data: 0.0001  max mem: 20618
[17:13:42.094205] Epoch: [0]  [  280/13000]  eta: 2:12:46  lr: 0.000002  closs: 1.5469 (1.6180)  time: 0.6280  data: 0.0001  max mem: 20618
[17:13:48.380554] Epoch: [0]  [  290/13000]  eta: 2:12:41  lr: 0.000002  closs: 1.5586 (1.6233)  time: 0.6282  data: 0.0001  max mem: 20618
[17:13:54.670131] Epoch: [0]  [  300/13000]  eta: 2:12:36  lr: 0.000002  closs: 1.3896 (1.6176)  time: 0.6287  data: 0.0001  max mem: 20618
[17:13:59.988396] Loss is nan, stopping training -> train_one_epoch
[2024-02-07 17:14:05,032] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 130839 closing signal SIGTERM
[2024-02-07 17:14:05,146] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 130838) of binary: /home/navaneeth/miniconda3/bin/python
Traceback (most recent call last):
  File "/home/navaneeth/miniconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetuning.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-07_17:14:05
  host      : sjsu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 130838)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

What can i do to solve this?

OpenGVLab / LLaMA-Adapter

Loss is nan, stopping training, while trying to reproduce alpaca_finetuning_v1 results. #144