I used bert-base-cased as the basic model and retrained my long-bert-512 .
long-bert settings:
attention windows (each layer is the same): 16
max_pos: 512
My sequence length is long enough. I think O(16*512) window attention should be faster than O(512*512) self-attention , however, I find that the inference time of long-bert-512 is more than bert-base-cased.
I used bert-base-cased as the basic model and retrained my long-bert-512 .
long-bert settings: attention windows (each layer is the same): 16 max_pos: 512
My sequence length is long enough. I think O(16*512) window attention should be faster than O(512*512) self-attention , however, I find that the inference time of long-bert-512 is more than bert-base-cased.
did I miss something?