Open YIKMAT opened 2 years ago
Hi, here is the detail:
the code is single GPU, you can configure gradient accumulation (see $update_freq
in configs)
I had tried reimplementing a general Stack Transformer, and I found that on long sequences, the memory cost of the stack got quite expensive. A softmax to compute attention in particular leads to quadratic growth, as each softmax gets progressively longer and needs to be kept until the optimizer step. Did you find a way to solve that, or is that related to the memory problems in this issue?
Sorry for the delay. I do not understand the question, stack-Transformer masks the attention of a normal transformer, and as such does not have any additional costs beyond mask computation
I meant, the backprop in a long sequence can get prohibitively expensive. When keeping the entire sequence, the softmax terms get longer and longer, and the early ones are kept until the end of the sequence unless you backprop each time step, so the total memory cost winds up being quadratic.
but that a property of Transformer, not stack-Transformer. That was my point. In that regard they are equal.
It is true that in the use case I tried it for (a transition based constituency parser), the transition sequences wound up being substantially longer than the sentence itself, and therefore the memory usage might be much higher for the transformer at the word input level
I attempted to train the model using bash
run/run_experiment.sh configs/amr2.0-structured-bart-large-sep-voc.sh
and it looks like my 12GB 2080Ti GPU doesn't have enough memory. In fact, I had two 12GB 2080Ti GPU on server, but only one of them used during training. Does the code use multi-GPUs? Is there anything else I need to modify ?