ZHAOTING / dialog-processing

NLG and NLU for dialogue processing
Apache License 2.0
43 stars 10 forks source link

GPU/hardware specs for joint segmentation & dialog act prediction experiments #6

Closed trangham283 closed 3 years ago

trangham283 commented 3 years ago

Hello!

Thanks again for making the code available and for the clear documentation! Quick question on your new implementation of the paper "Joint dialog act segmentation and recognition in human conversations using attention to dialog context" (CSL 2019): what were your hardware/GPU specs, in particular RAM? I'm getting OOM when training with a smaller (20) batch size:

RuntimeError: CUDA out of memory. Tried to allocate 8.48 GiB (GPU 0; 11.91 GiB total capacity; 8.66 GiB already allocated; 2.31 GiB free; 8.85 GiB reserved in total by PyTorch)

For reference, I'm using pytorch 1.7, python 3.6.5, and this is on a Titan Xp with 12Gb RAM. I'm wondering if this is a hardware issue or if there's some implementation issue that caused memory inefficiency in the new pytorch version 1.7.

Thanks a lot!

ZHAOTING commented 3 years ago

Hi @trangham283 , I believe I used some 12GB card (something like 1080 or Titan X). If you are using the same model hyperparameters, it should all depend on your input. Since the CtxED uses word-level attention, the number of tokens in context has a big impact on its used memory.

A few points you may want to check:

  1. Are you using the SwDA corpus? if not, what is the rough average length of sentences.
  2. What is the history length?
  3. If memory usage is okay during training but a problem in inference, maybe it is because of beam searching. So you may try setting a smaller beam size.

I have tested the code on pytorch 1.4 and it was fine. If things still don’t work, could you please take a try on an earlier pytorch version?

trangham283 commented 3 years ago

Hi @ZHAOTING

Thanks for the quick reply!

  1. I am using the SwDA corpus.
  2. I am currently testing history length = 3.
  3. I am running into this problem during training.

I also just tried reverting to pytorch 1.3, and am having the same problem.

I think my problem is that I am using GloVe embeddings of size 300, trained on the Gigaword corpus (400K vocab) vs. your default Twitter GloVe embeddings, which has a much larger vocab, but smaller dimension of 200.

I can try reducing the GloVe embedding size, but my plan was to use BERT embeddings, so it won't help with the memory issue. I saw in your other tasks (generation) that you are using RoBERTa, so I thought in the general the codebase could handle larger types of pretrained embeddings.

Thanks again for all the help and information!

ZHAOTING commented 3 years ago

@trangham283 Oh, then the vocab size is absolutely the problem. Instead of reducing the embedding size, reducing the vocab size should work well.

Directly using BERT/RoBERTa embedding (around 30000+ vocab size and 768 emb size) should be okay. If you want to use the current Glove embedding, just use the most frequent (e.g. top 30000) words in the vocab and make others UNK.

Good luck.

trangham283 commented 3 years ago

I am just using the top 10k vocab actually, the 400k vocab is Glove's default; when collecting embedding as in this line, unless I'm misunderstanding.

ZHAOTING commented 3 years ago

I see, I will test the code again tomorrow. Thanks for the information.

ZHAOTING commented 3 years ago

Hi @trangham283 , I have located the problem. The original code has used sentence-level encodings of previous utterances as contextual attention keys/values, but I have mistakenly used word-level encodings instead. The config can be changed by inputting argument --attn_type sent. And I have also made a new commit to make the default attention type sentence-level, with which the GPU memory usage should be no more than 2GB.

trangham283 commented 3 years ago

Awesome, thank you so much for looking into this! I'll close this issue.