infer speed longformer vs bert

I used bert-base-cased as the basic model and retrained my long-bert-512 .

long-bert settings: attention windows (each layer is the same): 16 max_pos: 512

My sequence length is long enough. I think O(16*512) window attention should be faster than O(512*512) self-attention , however, I find that the inference time of long-bert-512 is more than bert-base-cased.

did I miss something?

allenai / longformer

infer speed longformer vs bert #183