gluonnlp.data.batchify.Pad should add a parameter of maximum length

dmlc / gluon-nlp

NLP made easy

https://nlp.gluon.ai/

Apache License 2.0

2.56k stars 538 forks source link

gluonnlp.data.batchify.Pad should add a parameter of maximum length #590

Open fierceX opened 5 years ago

fierceX commented 5 years ago

gluonnlp.data.batchify.Pad should be added with a parameter of maximum length. Because variable-length input increases memory usage, it is more obvious to pre-train the bert model on the squad dataset.

In the https://github.com/dmlc/gluon-nlp/pull/493/commits/7219400545f15a7597d1bde98efd608ef96bcbd0 submission, I used gluonnlp.data.batchify.Pad to dynamically fill the length of the batch. However, it will cause a significant increase in the amount of memory used.

The specific performance is to use python finetune_squad.py --optimizer adam --batch_size 12 --lr 3e-5 --epochs 2 --gpu when running the script.

Previously occupied no more than 13G of GPU memory. However, after use, the 16G V100 will have insufficient memory.

szhengac commented 5 years ago

The purpose of pad is to form a rectangular input so that one can achieve mini-batch acceleration, and the number of paddings is determined by the maximum length in the batch. The memory cost can be reduced by using bucketing. You can also consider clipping the training sequence in the preprocessing pipeline.

vanewu commented 5 years ago

Hi, @fierceX , I simply looked at the code in 7219400, where the dataloader uses batchify directly, which may result in too much length difference for each sample in each batch of data when the batch is randomly generated. This can lead to too much unnecessary padding, which may be the reason for memory out of bounds. You can use nlp.data.sampler.FixedBucketSampler to improve performance, or use gluonnlp.data.ClipSequence to clip the maximum length of the sentence in data preprocessing.

fierceX commented 5 years ago

@kenjewu In the submission of https://github.com/dmlc/gluon-nlp/pull/493/commits/b6be61cd6d88d1e559c0360b5be35256bba7dc93, I used FixedBucketSampler. But GPU memory will also increase, and will gradually increase to the same size as before.

vanewu commented 5 years ago

Can you show sampler.stats() here? This may help to troubleshoot the problem.

fierceX commented 5 years ago

batch_size=6, num_buckets=10, ratio=0, shuffle=True

FixedBucketSampler:
  sample_num=88641, batch_num=14777
  key=[78, 112, 146, 180, 214, 248, 282, 316, 350, 384]
  cnt=[3698, 7202, 24221, 21614, 13458, 7671, 4806, 2772, 1499, 1700]
  batch_size=[6, 6, 6, 6, 6, 6, 6, 6, 6, 6]

vanewu commented 5 years ago

From the key you can see the length of the sequence in each bucket. Is this a big difference from the maximum length you set before? If the difference is too large, you may need to clip the long sequence.