Closed Othergreengrasses closed 2 months ago
Happy to take a look. Can you provide the hyperparameter YAML file and the command you used? (I didn't see that in the Gist, maybe I missed it though.)
Your device ran out of memory. You need to lower the max batch size and accumulate gradients to simulate the requested batch size.
On Thu, Apr 11, 2024 at 5:39 PM Kyle Gorman @.***> wrote:
Happy to take a look. Can you provide the hyperparameter YAML file and the command you used? (I didn't see that in the Gist, maybe I missed it though.)
— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/yoyodyne/issues/178#issuecomment-2050721560, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIQBSW5ASH6Y5FTLDL2GNDY44NM7AVCNFSM6AAAAABGDFINAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQG4ZDCNJWGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@kylebgorman Github doesn't support YAML file type so I am just pasting whatever is in the file (btw I got the YAML file from you)
method: bayes
metric:
name: val_accuracy
goal: maximize
parameters:
# Constants.
arch:
value: transformer
max_epochs:
value: 200
patience:
value: 40
reduceonplateau_mode:
value: accuracy
gradient_clip_val:
value: 3
# Hyperparameters.
attention_heads:
values: [4, 6, 8]
encoder_layers:
values: [4, 6, 8]
decoder_layers:
values: [4, 6, 8]
embedding_size:
distribution: q_uniform
q: 16
min: 16
max: 512
hidden_size:
distribution: q_uniform
q: 64
min: 64
max: 1024
dropout:
distribution: uniform
min: 0
max: 0.5
label_smoothing:
distribution: uniform
min: 0.0
max: 0.2
batch_size:
distribution: q_uniform
q: 128
min: 128
max: 2048
learning_rate:
distribution: log_uniform_values
min: 0.00001
max: 0.01
scheduler:
values: [null, reduceonplateau, warmupinvsqrt]
reduceonplateau_factor:
distribution: uniform
min: 0.1
max: 0.9
reduceonplateau_patience:
distribution: q_uniform
q: 1
min: 1
max: 5
min_lr:
distribution: log_uniform_values
min: 0.000001
max: 0.001
warmup_samples:
distribution: q_uniform
q: 100
min: 100
max: 5000000
command that I used - wandb sweep --entity ENTITY --project Google_ben transformer_broader_config.yaml ./train_wandb_sweep.py --entity ENTITY --project Google_ben --sweep_id SWEEPID --model_dir models --experiment Google_transformer --train g2p-Google-train.tsv --val g2p-Google-dev.tsv --arch transformer --patience 10 --max_time 00:06:00:00 --count 200 --accelerator gpu --seed 1818 --source_sep ' ' --target_sep ' '
@Adamits this is a pretty straightforward grid for a transformer, I think, so it's a surprise to me this would OOM on the vast majority of runs on a... I believe @Othergreengrasses is on a 4th generation Nvidia card. (That said I don't have one at home so I can only replicate with a 1st gen card.)
Yes, Kyle you are right. I am on a 4th generation Nvidia card.
Also, the error is saying that I have 8.19 MB free space then I am not able to understand why it is shooting OOM.
I'm running the experiment on a fresh environment (python 3.10) installing yoyodyne from source.
You could try the suggestion to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
(prepend it to whatever command you're running) if you haven't. I haven't tried this yet.
There's a panel in W&B for "GPU Memory Allocated (%)" (under "System"). If there is a memory leak across the multiple runs of a sweep, you'd expect that this would creep up as the number of sweeps increased. I just checked an old sweep (pre the supposed fix we put in place for this) and I don't see this pattern at all. Not sure what to make of that.
You could try the suggestion to set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
(prepend it to whatever command you're running) if you haven't. I haven't tried this yet.
This didn't work either.
After replicating some of these things on our lab machines (with 1080 GTXs, which have 8 GB of VRAM) I think that the transformer models we are working with are just too large for the combination of batch sizes, number of layers, and hidden layer dimensionalities. I am not seeing evidence of a leak. Automagical batch sizing (à la #148, but let's also stipulate that having found the max batch size it then picks the right size for mini-batches and the right number of mini-batches per batch) ought to handle this for good.
I'm going to close this for now. We can return to it later. #148 is the path forward...
I'm trying to run W&B sweeps for transformer. It was able to finish 5 sweeps successfully and failed on rest of 150 sweeps. Here is the error that I got after 5 sweeps -
Run x7n6tsz8 errored: Traceback (most recent call last): File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/wandb/agents/pyagent.py", line 308, in _run_job self._function() File "/home/aru-sarthak/yoyodyne_041123/yoyodyne/examples/wandb_sweeps/./train_wandb_sweep.py", line 43, in train_sweep best_checkpoint = train.train(trainer, model, datamodule, args.train_from) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/yoyodyne/train.py", line 253, in train trainer.fit(model, datamodule, ckpt_path=train_from) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit call._call_and_handle_interrupt( File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in _run self.strategy.setup(self) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/strategies/single_device.py", line 73, in setup self.model_to_device() File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/strategies/single_device.py", line 70, in model_to_device self.model.to(self.root_device) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 54, in to return super().to(args, **kwargs) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to return self._apply(convert) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply param_applied = fn(param) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 15.69 GiB of which 4.38 MiB is free. Process 2063 has 211.51 MiB memory in use. Including non-PyTorch memory, this process has 15.23 GiB memory in use. Of the allocated memory 14.92 GiB is allocated by PyTorch, and 60.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Gist to reproduce the error - https://gist.github.com/Othergreengrasses/4b950d336a112dd799b4120bcbeb60e7