Closed paul-chelarescu closed 3 years ago
Update: I believe I fixed the above issue. I installed pytorch 1.9 with CUDA 11.1 as per pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
from this link.
Even though it's hitting OOMs a couple of times so far, it seems to recover from them and train.
Perhaps update the README file with pytorch 1.9 as a requirement?
2021-08-02 18:47:18 | INFO | fairseq_cli.train | task: RDropTranslationTask
2021-08-02 18:47:18 | INFO | fairseq_cli.train | model: BARTModel
2021-08-02 18:47:18 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
2021-08-02 18:47:18 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
2021-08-02 18:47:23 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-08-02 18:47:23 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-08-02 18:47:23 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:47:23 | INFO | fairseq.utils | rank 0: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB
2021-08-02 18:47:23 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:47:23 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-08-02 18:47:23 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
2021-08-02 18:47:23 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
tcmalloc: large alloc 1625169920 bytes == 0x561bcc7ea000 @ 0x7f068e361b6b 0x7f068e381379 0x7f05c59c026e 0x7f05c59c19e2 0x7f05c7beecb9 0x7f066a6f7759 0x561af2daea65 0x561af2d6f7b2 0x561af2de2e65 0x561af2dde235 0x561af2d7034b 0x561af2d6fe59 0x561af2eb725d 0x561af2e26c3b 0x561af2d6ef01 0x561af2e60c0d 0x561af2de30d8 0x561af2dde235 0x561af2cafe2c 0x561af2de0318 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a
tcmalloc: large alloc 1625169920 bytes == 0x561c2ddcc000 @ 0x7f068e361b6b 0x7f068e381379 0x7f05c59c026e 0x7f05c59c19e2 0x7f05c7beecb9 0x7f066a6f7759 0x561af2daea65 0x561af2d6f7b2 0x561af2de2e65 0x561af2dde235 0x561af2d7034b 0x561af2d6fe59 0x561af2eb725d 0x561af2e26c3b 0x561af2d6ef01 0x561af2e60c0d 0x561af2de30d8 0x561af2dde235 0x561af2cafe2c 0x561af2de0318 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a
2021-08-02 18:47:29 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
2021-08-02 18:47:29 | INFO | fairseq.trainer | loading train data for epoch 1
2021-08-02 18:47:29 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
2021-08-02 18:47:29 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
2021-08-02 18:47:29 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
2021-08-02 18:47:29 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
2021-08-02 18:47:30 | INFO | fairseq.trainer | begin training epoch 1
/content/R-Drop/fairseq_src/fairseq/utils.py:345: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
2021-08-02 18:48:27 | INFO | train_inner | epoch 001: 100 / 253944 loss=14.455, nll_loss=9.638, ppl=796.58, wps=121.6, ups=1.78, wpb=68.4, bsz=1.1, num_updates=100, lr=6e-06, gnorm=232.649, clip=100, train_wall=56, gb_free=4.4, wall=64
2021-08-02 18:49:20 | INFO | train_inner | epoch 001: 200 / 253944 loss=10.226, nll_loss=6.293, ppl=78.39, wps=129.5, ups=1.87, wpb=69.4, bsz=1.1, num_updates=200, lr=1.2e-05, gnorm=35.015, clip=100, train_wall=53, gb_free=6.8, wall=117
2021-08-02 18:50:15 | INFO | train_inner | epoch 001: 300 / 253944 loss=10.418, nll_loss=6.639, ppl=99.69, wps=125.6, ups=1.84, wpb=68.1, bsz=1.1, num_updates=300, lr=1.8e-05, gnorm=31.224, clip=100, train_wall=54, gb_free=3.5, wall=171
2021-08-02 18:51:09 | INFO | train_inner | epoch 001: 400 / 253944 loss=10.046, nll_loss=6.277, ppl=77.55, wps=130.9, ups=1.82, wpb=71.8, bsz=1.2, num_updates=400, lr=2.4e-05, gnorm=30.56, clip=100, train_wall=54, gb_free=3.5, wall=226
2021-08-02 18:52:05 | INFO | train_inner | epoch 001: 500 / 253944 loss=10.067, nll_loss=6.331, ppl=80.51, wps=133.4, ups=1.81, wpb=73.6, bsz=1.2, num_updates=500, lr=3e-05, gnorm=26.353, clip=100, train_wall=55, gb_free=3.9, wall=281
2021-08-02 18:53:00 | INFO | train_inner | epoch 001: 600 / 253944 loss=10.086, nll_loss=6.371, ppl=82.79, wps=127.6, ups=1.81, wpb=70.6, bsz=1.1, num_updates=600, lr=2.98462e-05, gnorm=28.198, clip=100, train_wall=55, gb_free=7, wall=337
2021-08-02 18:53:56 | INFO | train_inner | epoch 001: 700 / 253944 loss=10.208, nll_loss=6.533, ppl=92.61, wps=128.5, ups=1.79, wpb=71.9, bsz=1.1, num_updates=700, lr=2.96923e-05, gnorm=24.811, clip=100, train_wall=55, gb_free=3.5, wall=393
2021-08-02 18:54:21 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 15.90 GiB total capacity; 14.76 GiB already allocated; 55.75 MiB free; 14.88 GiB reserved in total by PyTorch)
2021-08-02 18:54:21 | WARNING | fairseq.trainer | |===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 1 | cudaMalloc retries: 3 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 15046 MB | 15114 MB | 14505 GB | 14490 GB |
| from large pool | 15015 MB | 15083 MB | 14253 GB | 14239 GB |
| from small pool | 31 MB | 31 MB | 251 GB | 251 GB |
|---------------------------------------------------------------------------|
| Active memory | 15046 MB | 15114 MB | 14505 GB | 14490 GB |
| from large pool | 15015 MB | 15083 MB | 14253 GB | 14239 GB |
| from small pool | 31 MB | 31 MB | 251 GB | 251 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 15236 MB | 15236 MB | 21906 MB | 6670 MB |
| from large pool | 15194 MB | 15194 MB | 21516 MB | 6322 MB |
| from small pool | 42 MB | 208 MB | 390 MB | 348 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 123909 KB | 4558 MB | 10331 GB | 10331 GB |
| from large pool | 112863 KB | 4522 MB | 10040 GB | 10040 GB |
| from small pool | 11046 KB | 36 MB | 290 GB | 290 GB |
|---------------------------------------------------------------------------|
| Allocations | 2217 | 2218 | 3189 K | 3187 K |
| from large pool | 1081 | 1082 | 1074 K | 1073 K |
| from small pool | 1136 | 1237 | 2115 K | 2114 K |
|---------------------------------------------------------------------------|
| Active allocs | 2217 | 2218 | 3189 K | 3187 K |
| from large pool | 1081 | 1082 | 1074 K | 1073 K |
| from small pool | 1136 | 1237 | 2115 K | 2114 K |
|---------------------------------------------------------------------------|
| GPU reserved segments | 420 | 501 | 734 | 314 |
| from large pool | 399 | 399 | 539 | 140 |
| from small pool | 21 | 104 | 195 | 174 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 80 | 184 | 1469 K | 1469 K |
| from large pool | 56 | 116 | 666 K | 666 K |
| from small pool | 24 | 68 | 802 K | 802 K |
|===========================================================================|
2021-08-02 18:54:21 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
2021-08-02 18:54:52 | INFO | train_inner | epoch 001: 801 / 253944 loss=10.156, nll_loss=6.487, ppl=89.73, wps=126, ups=1.8, wpb=70.2, bsz=1.2, num_updates=800, lr=2.95385e-05, gnorm=3.871, clip=100, train_wall=55, gb_free=6, wall=504
2021-08-02 18:56:33 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 170.00 MiB (GPU 0; 15.90 GiB total capacity; 14.32 GiB already allocated; 59.75 MiB free; 14.88 GiB reserved in total by PyTorch)
2021-08-02 18:56:33 | WARNING | fairseq.trainer | |===========================================================================|| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 2 | cudaMalloc retries: 6 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 14665 MB | 14665 MB | 19211 GB | 19197 GB |
| from large pool | 14659 MB | 14659 MB | 18879 GB | 18865 GB |
| from small pool | 6 MB | 6 MB | 332 GB | 332 GB |
|---------------------------------------------------------------------------|
| Active memory | 14665 MB | 14665 MB | 19211 GB | 19197 GB || from large pool | 14659 MB | 14659 MB | 18879 GB | 18865 GB || from small pool | 6 MB | 6 MB | 332 GB | 332 GB |
|---------------------------------------------------------------------------|| GPU reserved memory | 15232 MB | 15232 MB | 26976 MB | 11744 MB || from large pool | 15216 MB | 15216 MB | 26420 MB | 11204 MB |
| from small pool | 16 MB | 208 MB | 556 MB | 540 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 580356 KB | 1288 MB | 13590 GB | 13590 GB |
| from large pool | 570355 KB | 1278 MB | 13206 GB | 13206 GB |
| from small pool | 10001 KB | 43 MB | 383 GB | 383 GB |
|---------------------------------------------------------------------------|| Allocations | 2217 | 2218 | 4214 K | 4212 K |
| from large pool | 1118 | 1118 | 1415 K | 1414 K |
| from small pool | 1099 | 1237 | 2798 K | 2797 K |
|---------------------------------------------------------------------------|
| Active allocs | 2217 | 2218 | 4214 K | 4212 K |
| from large pool | 1118 | 1118 | 1415 K | 1414 K || from small pool | 1099 | 1237 | 2798 K | 2797 K |
|---------------------------------------------------------------------------|| GPU reserved segments | 415 | 510 | 900 | 485 |
| from large pool | 407 | 407 | 622 | 215 |
| from small pool | 8 | 104 | 278 | 270 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 138 | 159 | 1950 K | 1950 K |
| from large pool | 124 | 127 | 887 K | 887 K |
| from small pool | 14 | 87 | 1063 K | 1063 K |
|===========================================================================|
2021-08-02 18:56:33 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
2021-08-02 18:56:42 | INFO | train_inner | epoch 001: 1002 / 253944 loss=10.157, nll_loss=6.531, ppl=92.46, wps=111.8, ups=1.82, wpb=61.5, bsz=1.1, num_updates=1000, lr=2.92308e-05, gnorm=24.907, clip=100, train_wall=54, gb_free=4.2, wall=559
2021-08-02 18:57:38 | INFO | train_inner | epoch 001: 1102 / 253944 loss=10.207, nll_loss=6.586, ppl=96.06, wps=122, ups=1.8, wpb=68, bsz=1, num_updates=1100, lr=2.90769e-05, gnorm=24.622, clip=100, train_wall=55, gb_free=7.5, wall=615
2021-08-02 18:58:34 | INFO | train_inner | epoch 001: 1202 / 253944 loss=10.424, nll_loss=6.821, ppl=113.07, wps=127.4, ups=1.79, wpb=71.1, bsz=1.2, num_updates=1200, lr=2.89231e-05, gnorm=23.413, clip=100, train_wall=55, gb_free=5.1, wall=671
2021-08-02 18:59:30 | INFO | train_inner | epoch 001: 1302 / 253944 loss=9.583, nll_loss=5.907, ppl=60.01, wps=140.5, ups=1.78, wpb=79.1, bsz=1.2, num_updates=1300, lr=2.87692e-05, gnorm=21.703, clip=100, train_wall=56, gb_free=3.5, wall=727
Thanks for you attentions, I agree with you," CUDA error: CUBLAS_STATUS_EXECUTION_FAILED " more likely is a version issue. Our environment is : GPU GeForce RTX 3090 (24G) NVIDIA Driver Version = 460.67 CUDA Version = 11.2 torch version = 1.8.1 . and Im not sure whether it work for other environment.
Hi, after following the instructions here to make the code run for abstractive text summarization, I am running into the following issue:
I am using CUDA 11.4 (tried with 11.0 before), pytorch 1.8.1, python 3.7. I have preprocessed the CNN/Daily Mail data as instructed, am using bart.large and the script/run_train.sh is in the default configuration.
If I run without the --fp16 option, my code fails instead in the following way
I have tried to use the bart.base model, thinking it could be due to the size requirements and that my GPU only has 16GB of memory, but I run into dictionary size issues as described here.
Any advice on the above?