CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning
Apache License 2.0
25 stars 15 forks source link

W&B sweeps giving OOM #178

Closed Othergreengrasses closed 2 months ago

Othergreengrasses commented 2 months ago

I'm trying to run W&B sweeps for transformer. It was able to finish 5 sweeps successfully and failed on rest of 150 sweeps. Here is the error that I got after 5 sweeps -

Run x7n6tsz8 errored: Traceback (most recent call last): File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/wandb/agents/pyagent.py", line 308, in _run_job self._function() File "/home/aru-sarthak/yoyodyne_041123/yoyodyne/examples/wandb_sweeps/./train_wandb_sweep.py", line 43, in train_sweep best_checkpoint = train.train(trainer, model, datamodule, args.train_from) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/yoyodyne/train.py", line 253, in train trainer.fit(model, datamodule, ckpt_path=train_from) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit call._call_and_handle_interrupt( File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in _run self.strategy.setup(self) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/strategies/single_device.py", line 73, in setup self.model_to_device() File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/strategies/single_device.py", line 70, in model_to_device self.model.to(self.root_device) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 54, in to return super().to(args, **kwargs) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to return self._apply(convert) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply param_applied = fn(param) File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 15.69 GiB of which 4.38 MiB is free. Process 2063 has 211.51 MiB memory in use. Including non-PyTorch memory, this process has 15.23 GiB memory in use. Of the allocated memory 14.92 GiB is allocated by PyTorch, and 60.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Gist to reproduce the error - https://gist.github.com/Othergreengrasses/4b950d336a112dd799b4120bcbeb60e7

kylebgorman commented 2 months ago

Happy to take a look. Can you provide the hyperparameter YAML file and the command you used? (I didn't see that in the Gist, maybe I missed it though.)

Adamits commented 2 months ago

Your device ran out of memory. You need to lower the max batch size and accumulate gradients to simulate the requested batch size.

On Thu, Apr 11, 2024 at 5:39 PM Kyle Gorman @.***> wrote:

Happy to take a look. Can you provide the hyperparameter YAML file and the command you used? (I didn't see that in the Gist, maybe I missed it though.)

— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/yoyodyne/issues/178#issuecomment-2050721560, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIQBSW5ASH6Y5FTLDL2GNDY44NM7AVCNFSM6AAAAABGDFINAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQG4ZDCNJWGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Othergreengrasses commented 2 months ago

@kylebgorman Github doesn't support YAML file type so I am just pasting whatever is in the file (btw I got the YAML file from you)

method: bayes
metric:
  name: val_accuracy
  goal: maximize
parameters:
  # Constants.
  arch:
    value: transformer
  max_epochs:
    value: 200
  patience:
    value: 40
  reduceonplateau_mode:
    value: accuracy
  gradient_clip_val:
    value: 3
  # Hyperparameters.
  attention_heads:
    values: [4, 6, 8]
  encoder_layers:
    values: [4, 6, 8]
  decoder_layers:
    values: [4, 6, 8]
  embedding_size:
    distribution: q_uniform
    q: 16
    min: 16
    max: 512
  hidden_size:
    distribution: q_uniform
    q: 64
    min: 64
    max: 1024
  dropout:
    distribution: uniform
    min: 0
    max: 0.5
  label_smoothing:
    distribution: uniform
    min: 0.0
    max: 0.2
  batch_size:
    distribution: q_uniform
    q: 128
    min: 128
    max: 2048
  learning_rate:
    distribution: log_uniform_values
    min: 0.00001
    max: 0.01
  scheduler:
    values: [null, reduceonplateau, warmupinvsqrt]
  reduceonplateau_factor:
    distribution: uniform
    min: 0.1
    max: 0.9
  reduceonplateau_patience:
    distribution: q_uniform
    q: 1
    min: 1
    max: 5
  min_lr:
    distribution: log_uniform_values
    min: 0.000001
    max: 0.001
  warmup_samples:
    distribution: q_uniform
    q: 100
    min: 100
    max: 5000000

command that I used - wandb sweep --entity ENTITY --project Google_ben transformer_broader_config.yaml ./train_wandb_sweep.py --entity ENTITY --project Google_ben --sweep_id SWEEPID --model_dir models --experiment Google_transformer --train g2p-Google-train.tsv --val g2p-Google-dev.tsv --arch transformer --patience 10 --max_time 00:06:00:00 --count 200 --accelerator gpu --seed 1818 --source_sep ' ' --target_sep ' '

kylebgorman commented 2 months ago

@Adamits this is a pretty straightforward grid for a transformer, I think, so it's a surprise to me this would OOM on the vast majority of runs on a... I believe @Othergreengrasses is on a 4th generation Nvidia card. (That said I don't have one at home so I can only replicate with a 1st gen card.)

Othergreengrasses commented 2 months ago

Yes, Kyle you are right. I am on a 4th generation Nvidia card.

Also, the error is saying that I have 8.19 MB free space then I am not able to understand why it is shooting OOM.

I'm running the experiment on a fresh environment (python 3.10) installing yoyodyne from source.

kylebgorman commented 2 months ago

You could try the suggestion to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (prepend it to whatever command you're running) if you haven't. I haven't tried this yet.

kylebgorman commented 2 months ago

There's a panel in W&B for "GPU Memory Allocated (%)" (under "System"). If there is a memory leak across the multiple runs of a sweep, you'd expect that this would creep up as the number of sweeps increased. I just checked an old sweep (pre the supposed fix we put in place for this) and I don't see this pattern at all. Not sure what to make of that.

Othergreengrasses commented 2 months ago

You could try the suggestion to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (prepend it to whatever command you're running) if you haven't. I haven't tried this yet.

This didn't work either.

kylebgorman commented 2 months ago

After replicating some of these things on our lab machines (with 1080 GTXs, which have 8 GB of VRAM) I think that the transformer models we are working with are just too large for the combination of batch sizes, number of layers, and hidden layer dimensionalities. I am not seeing evidence of a leak. Automagical batch sizing (à la #148, but let's also stipulate that having found the max batch size it then picks the right size for mini-batches and the right number of mini-batches per batch) ought to handle this for good.

kylebgorman commented 2 months ago

I'm going to close this for now. We can return to it later. #148 is the path forward...