allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.05k stars 276 forks source link

longformer speed compared to bert model #175

Open gkim89 opened 3 years ago

gkim89 commented 3 years ago

We are trying to use a LongFormer and Bert model for multi-label classification of different documents.

When we use the BERT model (BertForSequenceClassification) with max length 512 (batch size 8) each epoch takes approximately 30 minutes.

When we use LongFormer (LongformerForSequenceClassification with the 'allenai/longformer-base-4096' and gradient_checkpointing=True) with max length 4096 (batch size 1, Gradient Accumulation step 8) each epoch takes approximately 12 hours.

Is this reasonable or are we missing something? Is there anything that we can try to make the training faster?

krstp commented 1 month ago

It is already a few years down-the-line, but I am seeing exactly the same behavior. Small datasets yield amazing speedup, but much larger inputs choke it impossibly.

If I can ask:


In my case I trimming input to 4096 tokens, also going with FP16, and feeding data into managed batches... I/O for really large input seems to be choking the data-load with every batch – I can track it in nvidia GPU workload, whenever it batches the GPU workload drops to really low percentage (1%), then when it retrains the workload jumps to almost full load (100%).

Last idea I have is to develop custom batch feeder 🥵, which works well on different large feed unrelated case with concurrent.features but in theory the functools partial on top of tokenizer feed should suffice... in theory... Nonetheless, the practice atm only provides benefit on small dataset, for example on small dataset I get x2.6 speed-up (with batch 16 vs 1)... but it does not seem to apply on much larger dataset that is approx. x3500 larger.

The other thing I tried is to feed the batches to GPU in an async fashion; surely it is somewhat helping including fp16.

My other suspicion could be LongFormer is not optimized to run on Nvidia A100... is it? I did toy on Nvidua T4 as well, but here it limits me to stochastic gradient descent (batch of 1).

BTW: I did also try to use SSD for faster data manipulation, but this does not seem to be the main choke-point atm.

I might use some time to investigate nvidia-profiler, which should provide what else might be stealing the processing time, but regardless of that I already know my input dataset is in some way an overkill and combined with larger context window (4096) it adds some meaningful overhead.

To repeat myself, I am still tempted to experiment more as small dataset with A100 yields some good speedup... but what does the small dataset really mean in face of reality? :)