Out of memory with default configs/train.json on 4*24GB GPU - Githubissues

LUMIA-Group / rasat

The official implementation of the paper "RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL"(EMNLP 2022)

https://arxiv.org/abs/2205.06983

Apache License 2.0

63 stars 18 forks source link

Out of memory with default configs/train.json on 4*24GB GPU #10

Closed shenyang0111ucf closed 1 year ago

shenyang0111ucf commented 1 year ago

Hi @JiexingQi I found you asked similar question here: https://github.com/ServiceNow/picard/issues/29. I tried to train t5-3b to use CUDA_VISIBLE_DEVICES="0,1,2,3" python3 -m torch.distributed.launch --nnodes=1 --nproc_per_node=4 seq2seq/run_seq2seq.py configs/train.json with even config like this: "per_device_train_batch_size": 1, "per_device_eval_batch_size": 1, "gradient_accumulation_steps": 1, "gradient_checkpointing": true, But I still got out of memory error and all four GPUs' memory are used up (about 22GB used for each of the GPU) I think you must have some similar experience when using picard code. Could you show me how you solve this annoying out of memory problem? Thank you!

JiexingQi commented 1 year ago

Hi, @shenyang0111ucf which type of GPU do you use?

shenyang0111ucf commented 1 year ago

Hi, @shenyang0111ucf which type of GPU do you use?

@JiexingQi 4 NVIDIA TITAN RTX 24GB card.

JiexingQi commented 1 year ago

24 GB memory GPU seems not able to train the T5-3B model, we use 40 GB A100 to train it (the same as PICARD). By the way, the evaluation can be run at 24 GB 3090 GPU.

shenyang0111ucf commented 1 year ago

24 GB memory GPU seems not able to train the T5-3B model, we use 40 GB A100 to train it (the same as PICARD). By the way, the evaluation can be run at 24 GB 3090 GPU.

I tried to use four 24GB graphic cards instead of one 40 GB A100 to train the model, do you have any experience with "torch.distributed.launch" to make it happen?

JiexingQi commented 1 year ago

may be you could try model parallel in this situation, but I did not have a try.

shenyang0111ucf commented 1 year ago

may be you could try model parallel in this situation, but I did not have a try.

Ok, I will try to find out how to fix this problem by model parallel. Thank you for your time!

JiexingQi commented 1 year ago

You are welcome!

kanseaveg commented 1 year ago

Excuse me, regarding the question raised by @shenyang0111ucf , I would like to ask if the T5-3B can run on 4 NVIDIA GeForce RTX 3090 graphics cards. Each graphics card also has 24GB. Thank you. @JiexingQi

JiexingQi commented 1 year ago

24 GB memory GPU seems not able to train the T5-3B model, we use 40 GB A100 to train it (the same as PICARD). By the way, the evaluation can be run at 24 GB 3090 GPU.

I think it is not enough for training but worked for evaluation.