Closed rbvh closed 9 months ago
Your code looks correct.
The way the extrapolation Ernie is implemented is by removing the absolute position_embeddings of text input and utilizing RoPE/ALibi in ErnieLayoutSelfAttention. This might harm performance in certain situations.
I recommend experimenting with other extrapolation configurations. Consider giving DynamicNTKRoPE or Alibi a try.
def set_config_for_extrapolation(config):
# RoPE config
config.use_rope_attention_bias = True
config.rope_type = "dynamic" # "dynamic" of "linear" or "mixed_base"
# when scale_factor=1.0, RoPE is NTKScaleRoPE, when scale_factor > 1, RoPE becomes DynamicallyNTKScaleRope
config.rope_scaling_factor = 1.0
config.fix_base = False # please refer to https://normxu.github.io/Rethinking-Rotary-Position-Embedding-2/
config.b = 0.6 # please refer to https://normxu.github.io/Rethinking-Rotary-Position-Embedding-2/
# alibi for encoder https://github.com/lucidrains/x-transformers/pull/88
config.use_alibi = False
config.learnable_alibi = False
# attention scale
config.use_entropy_scale = True # https://openreview.net/forum?id=qc9O2EtrMI-
# Others
config.has_relative_attention_bias = False
config.consequent_visual_bias = True
By the way, what is your downstream task? I tried token classification with extrapolation Ernie and it looks good.
Alright, thanks, I will do some runs with other options and report back.
My task is simultaneous token classification and relation extraction. I can try fine-tuning on just token classification, but I suspect the results will be similar. Maybe just a quirk of my specific data domain.
@rbvh Also, you may want to try to remove the visual position embedding in the embedding just like removing the text position embedding I did to expand the context length.
I even tried to remove all visual components in the model and only used text input on the token classification task. Surprisingly, it still worked well on my dataset.
I will update the codes at my earliest convenience.
Just reporting back here: I tried config.rope_type = "dynamic"
and config.use_alibi = True
and the results are similar. It might just be that my use-case does not benefit much from longer context length.
@NormXU Do you have any reason to believe that removing the visual position embeddings could offer better performance? They are completely independent of the context length of the texual part of the model right?
@rbvh Yes, as you pointed out, the visual input and text input are quite independent to each other, so I think it is safe to removed it. Besides, I want to augment my training dataset by randomly adjusting the coordinates of each bounding box. This augmentation method cannot work with visual inputs.
First off, thank you very much for making this repo. It has been very helpful for me.
I've been trying your modifications to extend the max seq length, just want to make sure I'm doing the right thing. I use my own model that looks something like
With the new model, I just replace with
And then I just fine-tune on data with larger sequence than before. Is that the right way to use it? I ask because the extrapolation model peforms significantly worse in my use-case than the regular one, both with 512 and 1024 sequence lengths.
Thanks