I used bert-base as the basic model and retrained my long-bert with a length of 1536. Then I compared the difference in inference speed of the original bert-base-1536. After a lot of testing, I found that long-bert-1536 and bert-base-1536 are basically the same in inference speed. I see a similar problem #106 , but the length of my test data is all greater than 1000. I think window attention should be faster than self-attention because the amount of calculation is smaller, but why does this problem occur? Here are some settings:
attention windows (each layer is the same): 512
Global attention: only used for cls token
Inference device: cpu
task: text classification
By the way, does the size of the attention window affect the speed of inference? I tested different window sizes, but the speed is basically the same.
I used bert-base as the basic model and retrained my long-bert with a length of 1536. Then I compared the difference in inference speed of the original bert-base-1536. After a lot of testing, I found that long-bert-1536 and bert-base-1536 are basically the same in inference speed. I see a similar problem #106 , but the length of my test data is all greater than 1000. I think window attention should be faster than self-attention because the amount of calculation is smaller, but why does this problem occur? Here are some settings: attention windows (each layer is the same): 512 Global attention: only used for cls token Inference device: cpu task: text classification
By the way, does the size of the attention window affect the speed of inference? I tested different window sizes, but the speed is basically the same.