Open NavaneethNidadavolu opened 9 months ago
I'm using 2 NVIDIA GeForce RTX 3090 GPUs. (Memory: 24576MiB each)
I'm try to fine-tune the model with alpaca_data which is provided to replicated the results of this paper. i've put batch size as 2 because 4 gives me this error:
torch.cuda.OutOfMemoryError: CUDA out of memory.
Command:
!OMP_NUM_THREADS=8 torchrun --nproc_per_node 2 finetuning.py \ --model Llama7B_adapter \ --llama_model_path /home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/TARGET/ \ --data_path ./alpaca_data.json \ --adapter_layer 30 \ --adapter_len 10 \ --max_seq_len 512 \ --batch_size 2 \ --epochs 5 \ --warmup_epochs 2 \ --blr 9e-3 \ --weight_decay 0.02 \ --output_dir ./checkpoint/
Logs:
| distributed init (rank 0): env://, gpu 0 | distributed init (rank 1): env://, gpu 1 [17:10:32.992362] job dir: /home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/alpaca_finetuning_v1 [17:10:32.992448] Namespace(batch_size=2, epochs=5, accum_iter=1, llama_model_path='/home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/TARGET/', model='Llama7B_adapter', adapter_layer=30, adapter_len=10, max_seq_len=512, weight_decay=0.02, lr=None, blr=0.009, min_lr=0.0, warmup_epochs=2, data_path='./alpaca_data.json', output_dir='./checkpoint/', log_dir='./output_dir', device='cuda', seed=0, resume='', start_epoch=0, num_workers=10, pin_mem=True, world_size=2, local_rank=-1, dist_on_itp=False, dist_url='env://', rank=0, gpu=0, distributed=True, dist_backend='nccl') [17:10:33.198872] <__main__.InstructionDataset object at 0x7fdcb9bd85e0> [17:10:33.198914] <__main__.InstructionDataset object at 0x7fdc2dab6da0> [17:10:33.198952] Sampler_train = <torch.utils.data.distributed.DistributedSampler object at 0x7fdc2dab6d70> [17:10:39.061337] /home/navaneeth/workspace/jupyter-workspace/LLaMA-Adapter/TARGET/7B/consolidated.00.pth /home/navaneeth/.local/lib/python3.10/site-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.) _C._set_default_tensor_type(t) /home/navaneeth/.local/lib/python3.10/site-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.) _C._set_default_tensor_type(t) [17:10:44.010087] Model = Transformer( (tok_embeddings): Embedding(32000, 4096) (adapter_query): Embedding(300, 4096) (criterion): CrossEntropyLoss() (layers): ModuleList( (0-31): 32 x TransformerBlock( (attention): Attention( (wq): Linear(in_features=4096, out_features=4096, bias=False) (wk): Linear(in_features=4096, out_features=4096, bias=False) (wv): Linear(in_features=4096, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=4096, bias=False) ) (feed_forward): FeedForward( (w1): Linear(in_features=4096, out_features=11008, bias=False) (w2): Linear(in_features=11008, out_features=4096, bias=False) (w3): Linear(in_features=4096, out_features=11008, bias=False) ) (attention_norm): RMSNorm() (ffn_norm): RMSNorm() ) ) (norm): RMSNorm() (output): Linear(in_features=4096, out_features=32000, bias=False) ) [17:10:44.010150] base lr: 9.00e-03 [17:10:44.010164] actual lr: 1.41e-04 [17:10:44.010175] accumulate grad iterations: 1 [17:10:44.010185] effective batch size: 4 [17:10:46.081588] AdamW ( Parameter Group 0 amsgrad: False betas: (0.9, 0.95) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None lr: 0.000140625 maximize: False weight_decay: 0.0 Parameter Group 1 amsgrad: False betas: (0.9, 0.95) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None lr: 0.000140625 maximize: False weight_decay: 0.02 ) [17:10:46.081681] Start training for 5 epochs [17:10:46.082621] log_dir: ./output_dir [W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [17:10:47.481772] Epoch: [0] [ 0/13000] eta: 5:03:00 lr: 0.000000 closs: 1.5186 (1.5186) time: 1.3985 data: 0.3370 max mem: 20596 [17:10:53.607369] Epoch: [0] [ 10/13000] eta: 2:28:04 lr: 0.000000 closs: 1.3838 (1.5683) time: 0.6840 data: 0.0307 max mem: 20618 [17:10:59.724177] Epoch: [0] [ 20/13000] eta: 2:20:30 lr: 0.000000 closs: 1.3838 (1.6151) time: 0.6121 data: 0.0001 max mem: 20618 [17:11:05.859398] Epoch: [0] [ 30/13000] eta: 2:17:53 lr: 0.000000 closs: 1.5078 (1.6011) time: 0.6125 data: 0.0001 max mem: 20618 [17:11:12.016458] Epoch: [0] [ 40/13000] eta: 2:16:36 lr: 0.000000 closs: 1.5508 (1.6249) time: 0.6146 data: 0.0001 max mem: 20618 [17:11:18.185472] Epoch: [0] [ 50/13000] eta: 2:15:50 lr: 0.000000 closs: 1.5430 (1.6123) time: 0.6163 data: 0.0001 max mem: 20618 [17:11:24.369595] Epoch: [0] [ 60/13000] eta: 2:15:21 lr: 0.000000 closs: 1.4756 (1.6185) time: 0.6176 data: 0.0001 max mem: 20618 [17:11:30.565958] Epoch: [0] [ 70/13000] eta: 2:15:00 lr: 0.000000 closs: 1.4609 (1.5902) time: 0.6190 data: 0.0001 max mem: 20618 [17:11:36.784317] Epoch: [0] [ 80/13000] eta: 2:14:46 lr: 0.000000 closs: 1.4434 (1.5902) time: 0.6207 data: 0.0001 max mem: 20618 [17:11:43.015473] Epoch: [0] [ 90/13000] eta: 2:14:36 lr: 0.000000 closs: 1.5264 (1.5991) time: 0.6224 data: 0.0001 max mem: 20618 [17:11:49.250349] Epoch: [0] [ 100/13000] eta: 2:14:27 lr: 0.000001 closs: 1.5889 (1.6096) time: 0.6232 data: 0.0001 max mem: 20618 [17:11:55.489798] Epoch: [0] [ 110/13000] eta: 2:14:19 lr: 0.000001 closs: 1.6016 (1.6066) time: 0.6237 data: 0.0001 max mem: 20618 [17:12:01.741387] Epoch: [0] [ 120/13000] eta: 2:14:12 lr: 0.000001 closs: 1.4668 (1.6057) time: 0.6245 data: 0.0001 max mem: 20618 [17:12:07.992549] Epoch: [0] [ 130/13000] eta: 2:14:06 lr: 0.000001 closs: 1.5322 (1.6070) time: 0.6251 data: 0.0001 max mem: 20618 [17:12:14.251213] Epoch: [0] [ 140/13000] eta: 2:14:00 lr: 0.000001 closs: 1.5322 (1.6222) time: 0.6254 data: 0.0001 max mem: 20618 [17:12:20.515632] Epoch: [0] [ 150/13000] eta: 2:13:55 lr: 0.000001 closs: 1.5615 (1.6169) time: 0.6261 data: 0.0001 max mem: 20618 [17:12:26.784347] Epoch: [0] [ 160/13000] eta: 2:13:50 lr: 0.000001 closs: 1.3867 (1.6352) time: 0.6266 data: 0.0001 max mem: 20618 [17:12:33.056327] Epoch: [0] [ 170/13000] eta: 2:13:45 lr: 0.000001 closs: 1.3955 (1.6270) time: 0.6270 data: 0.0001 max mem: 20618 [17:12:39.323410] Epoch: [0] [ 180/13000] eta: 2:13:39 lr: 0.000001 closs: 1.4873 (1.6272) time: 0.6269 data: 0.0001 max mem: 20618 [17:12:45.595449] Epoch: [0] [ 190/13000] eta: 2:13:34 lr: 0.000001 closs: 1.3467 (1.6120) time: 0.6269 data: 0.0001 max mem: 20618 [17:12:51.868360] Epoch: [0] [ 200/13000] eta: 2:13:29 lr: 0.000001 closs: 1.3467 (1.6073) time: 0.6272 data: 0.0001 max mem: 20618 [17:12:58.145625] Epoch: [0] [ 210/13000] eta: 2:13:24 lr: 0.000001 closs: 1.3662 (1.6042) time: 0.6275 data: 0.0001 max mem: 20618 [17:13:04.422560] Epoch: [0] [ 220/13000] eta: 2:13:19 lr: 0.000001 closs: 1.4150 (1.5964) time: 0.6277 data: 0.0001 max mem: 20618 [17:13:10.699249] Epoch: [0] [ 230/13000] eta: 2:13:13 lr: 0.000001 closs: 1.4043 (1.6040) time: 0.6276 data: 0.0001 max mem: 20618 [17:13:16.974919] Epoch: [0] [ 240/13000] eta: 2:13:08 lr: 0.000001 closs: 1.4766 (1.6143) time: 0.6276 data: 0.0001 max mem: 20618 [17:13:23.254382] Epoch: [0] [ 250/13000] eta: 2:13:03 lr: 0.000001 closs: 1.6289 (1.6176) time: 0.6277 data: 0.0001 max mem: 20618 [17:13:29.533807] Epoch: [0] [ 260/13000] eta: 2:12:57 lr: 0.000001 closs: 1.4199 (1.6184) time: 0.6279 data: 0.0001 max mem: 20618 [17:13:35.815718] Epoch: [0] [ 270/13000] eta: 2:12:52 lr: 0.000001 closs: 1.3975 (1.6182) time: 0.6280 data: 0.0001 max mem: 20618 [17:13:42.094205] Epoch: [0] [ 280/13000] eta: 2:12:46 lr: 0.000002 closs: 1.5469 (1.6180) time: 0.6280 data: 0.0001 max mem: 20618 [17:13:48.380554] Epoch: [0] [ 290/13000] eta: 2:12:41 lr: 0.000002 closs: 1.5586 (1.6233) time: 0.6282 data: 0.0001 max mem: 20618 [17:13:54.670131] Epoch: [0] [ 300/13000] eta: 2:12:36 lr: 0.000002 closs: 1.3896 (1.6176) time: 0.6287 data: 0.0001 max mem: 20618 [17:13:59.988396] Loss is nan, stopping training -> train_one_epoch [2024-02-07 17:14:05,032] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 130839 closing signal SIGTERM [2024-02-07 17:14:05,146] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 130838) of binary: /home/navaneeth/miniconda3/bin/python Traceback (most recent call last): File "/home/navaneeth/miniconda3/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/navaneeth/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ finetuning.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-02-07_17:14:05 host : sjsu rank : 0 (local_rank: 0) exitcode : 1 (pid: 130838) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
What can i do to solve this?
Did you solve it? How did you solve it?
I'm using 2 NVIDIA GeForce RTX 3090 GPUs. (Memory: 24576MiB each)
I'm try to fine-tune the model with alpaca_data which is provided to replicated the results of this paper. i've put batch size as 2 because 4 gives me this error:
Command:
Logs:
What can i do to solve this?