AndriyMulyar / bert_document_classification

architectures and pre-trained models for long document classification.
154 stars 47 forks source link

backpropgation on chunks? #14

Open vr25 opened 4 years ago

vr25 commented 4 years ago

Hi,

When the document chunks are fed to the data parallel model, how is the loss backpropagated? Is it for every chunk?

Also, do you unfreeze and fine-tune for the classification task?

Thank you!

AndriyMulyar commented 4 years ago
  1. Yes, separate for every chunk.
  2. In our datasets we found it sufficient to fine-tune only the final transformer layer.

On Fri, Oct 2, 2020, 11:12 AM Vipula Rawte notifications@github.com wrote:

Hi,

When the document chunks are fed to the data parallel model, how is the loss backpropagated? Is it for every chunk?

Also, do you unfreeze and fine-tune for the classification task?

Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AndriyMulyar/bert_document_classification/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJ4TBT57F53RQMV4ACPIJTSIXUWRANCNFSM4SB2AJKA .

vr25 commented 4 years ago

More explanation on how loss is calculated for every chunk separately? I mean the entire document has a target label and so AFAIU, the loss would be calculated for this target, right? Please let me know if I am missing something.

Also, what is the maximum number of chunks in the entire dataset?

The default config has bert_batch_size=7 but I have some documents with a total number of chunks=125 per document. In such cases, if I set bert_batch_size to 125, I run into CUDA OOM error.

Any suggestions for this?

Thanks!