agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

PT5_LoRA_Finetuning_per_prot.ipynb - memory accumulation during validation #153

Open Fredditeddy opened 4 months ago

Fredditeddy commented 4 months ago

Hi all,

I am currently experimenting with your provided code. Your plot indicating memory usage for the different batch sizes & max_length seems to fit perfectly for our setup for training. However, when monitoring the memory usage two things are noticeable:

  1. Memory seems to not be freed after training
  2. Memory seems to accumulate during validation.

I could not find a solution for 1.

For 2. it seems to work, to set eval_accumulation_steps, which is transferring the model outputs to CPU.

Do you have an idea?

Keep up the great work.

Best wishes, Frederik

Fredditeddy commented 4 months ago

Update:

eval_accumulation_steps does not work, since it accumulates all tensors on RAM.

What works so far is, not returning hidden_states and attentions.

However, I do not understand, why this is not an issue for the training loop.

I additionally added a Callback after each epoch to use torch.cuda.emtpy_cache() which seems to free the memory after the training loop.