IBM / transition-amr-parser

SoTA Abstract Meaning Representation (AMR) parsing with word-node alignments in Pytorch. Includes checkpoints and other tools such as statistical significance Smatch.
Apache License 2.0
246 stars 48 forks source link

Problems with CUDA out of memory #23

Open YIKMAT opened 2 years ago

YIKMAT commented 2 years ago

I attempted to train the model using bash run/run_experiment.sh configs/amr2.0-structured-bart-large-sep-voc.sh and it looks like my 12GB 2080Ti GPU doesn't have enough memory. In fact, I had two 12GB 2080Ti GPU on server, but only one of them used during training. Does the code use multi-GPUs? Is there anything else I need to modify ?

YIKMAT commented 2 years ago

Hi, here is the detail: gpu0 gpu1

ramon-astudillo commented 2 years ago

the code is single GPU, you can configure gradient accumulation (see $update_freq in configs)

AngledLuffa commented 2 years ago

I had tried reimplementing a general Stack Transformer, and I found that on long sequences, the memory cost of the stack got quite expensive. A softmax to compute attention in particular leads to quadratic growth, as each softmax gets progressively longer and needs to be kept until the optimizer step. Did you find a way to solve that, or is that related to the memory problems in this issue?

ramon-astudillo commented 2 years ago

Sorry for the delay. I do not understand the question, stack-Transformer masks the attention of a normal transformer, and as such does not have any additional costs beyond mask computation

AngledLuffa commented 2 years ago

I meant, the backprop in a long sequence can get prohibitively expensive. When keeping the entire sequence, the softmax terms get longer and longer, and the early ones are kept until the end of the sequence unless you backprop each time step, so the total memory cost winds up being quadratic.

ramon-astudillo commented 1 year ago

but that a property of Transformer, not stack-Transformer. That was my point. In that regard they are equal.

AngledLuffa commented 1 year ago

It is true that in the use case I tried it for (a transition based constituency parser), the transition sequences wound up being substantially longer than the sentence itself, and therefore the memory usage might be much higher for the transformer at the word input level