longformer speed compared to bert model

It is already a few years down-the-line, but I am seeing exactly the same behavior. Small datasets yield amazing speedup, but much larger inputs choke it impossibly.

If I can ask:

What was the task you were trying to use it on?
How large was your dataset? Number or rows
How large was the core input (text) in the original input you were feeding in?

In my case I trimming input to 4096 tokens, also going with FP16, and feeding data into managed batches... I/O for really large input seems to be choking the data-load with every batch – I can track it in nvidia GPU workload, whenever it batches the GPU workload drops to really low percentage (1%), then when it retrains the workload jumps to almost full load (100%).

Last idea I have is to develop custom batch feeder 🥵, which works well on different large feed unrelated case with concurrent.features but in theory the functools partial on top of tokenizer feed should suffice... in theory... Nonetheless, the practice atm only provides benefit on small dataset, for example on small dataset I get x2.6 speed-up (with batch 16 vs 1)... but it does not seem to apply on much larger dataset that is approx. x3500 larger.

The other thing I tried is to feed the batches to GPU in an async fashion; surely it is somewhat helping including fp16.

My other suspicion could be LongFormer is not optimized to run on Nvidia A100... is it? I did toy on Nvidua T4 as well, but here it limits me to stochastic gradient descent (batch of 1).

BTW: I did also try to use SSD for faster data manipulation, but this does not seem to be the main choke-point atm.

I might use some time to investigate nvidia-profiler, which should provide what else might be stealing the processing time, but regardless of that I already know my input dataset is in some way an overkill and combined with larger context window (4096) it adds some meaningful overhead.

To repeat myself, I am still tempted to experiment more as small dataset with A100 yields some good speedup... but what does the small dataset really mean in face of reality? :)

allenai / longformer

longformer speed compared to bert model #175