NormXU / ERNIE-Layout-Pytorch

An unofficial Pytorch implementation of ERNIE-Layout which is originally released through PaddleNLP.
http://arxiv.org/abs/2210.06155
MIT License
99 stars 11 forks source link

Using the new extended seq length #20

Closed rbvh closed 9 months ago

rbvh commented 1 year ago

First off, thank you very much for making this repo. It has been very helpful for me.

I've been trying your modifications to extend the max seq length, just want to make sure I'm doing the right thing. I use my own model that looks something like

class ErnieLayoutCustom(ErnieLayoutPretrainedModel):
    def __init__(self, config):
        super().__init__(self, config)
            self.ernie_layout = ErnieLayoutModel(config)
            self.init_weights()        

With the new model, I just replace with

from ernie.modeling_erine_layout_extrapolation import ErnieLayoutPretrainedModel as ErnieLayoutPretrainedModelExtrapolation
from ernie.modeling_erine_layout_extrapolation import ErnieLayoutModel as ErnieLayoutModelExtrapolation

class ErnieLayoutCustomExtrapolation(ErnieLayoutPretrainedModelExtrapolation):
    def __init__(self, config):
        super().__init__(self, config)
            set_config_for_extrapolation(config)
            self.ernie_layout = ErnieLayoutModelExtrapolation(config)
            self.init_weights()        

And then I just fine-tune on data with larger sequence than before. Is that the right way to use it? I ask because the extrapolation model peforms significantly worse in my use-case than the regular one, both with 512 and 1024 sequence lengths.

Thanks

NormXU commented 1 year ago

Your code looks correct.

The way the extrapolation Ernie is implemented is by removing the absolute position_embeddings of text input and utilizing RoPE/ALibi in ErnieLayoutSelfAttention. This might harm performance in certain situations.

I recommend experimenting with other extrapolation configurations. Consider giving DynamicNTKRoPE or Alibi a try.

def set_config_for_extrapolation(config):
    # RoPE config
    config.use_rope_attention_bias = True
    config.rope_type = "dynamic"  # "dynamic" of "linear" or "mixed_base"
    # when scale_factor=1.0, RoPE is NTKScaleRoPE, when scale_factor > 1, RoPE becomes DynamicallyNTKScaleRope
    config.rope_scaling_factor = 1.0
    config.fix_base = False  # please refer to https://normxu.github.io/Rethinking-Rotary-Position-Embedding-2/
    config.b = 0.6   # please refer to https://normxu.github.io/Rethinking-Rotary-Position-Embedding-2/

    #  alibi for encoder https://github.com/lucidrains/x-transformers/pull/88
    config.use_alibi = False
    config.learnable_alibi = False

    # attention scale
    config.use_entropy_scale = True  # https://openreview.net/forum?id=qc9O2EtrMI-

    # Others
    config.has_relative_attention_bias = False
    config.consequent_visual_bias = True

By the way, what is your downstream task? I tried token classification with extrapolation Ernie and it looks good.

rbvh commented 1 year ago

Alright, thanks, I will do some runs with other options and report back.

My task is simultaneous token classification and relation extraction. I can try fine-tuning on just token classification, but I suspect the results will be similar. Maybe just a quirk of my specific data domain.

NormXU commented 1 year ago

@rbvh Also, you may want to try to remove the visual position embedding in the embedding just like removing the text position embedding I did to expand the context length.

I even tried to remove all visual components in the model and only used text input on the token classification task. Surprisingly, it still worked well on my dataset.

I will update the codes at my earliest convenience.

rbvh commented 1 year ago

Just reporting back here: I tried config.rope_type = "dynamic" and config.use_alibi = True and the results are similar. It might just be that my use-case does not benefit much from longer context length.

@NormXU Do you have any reason to believe that removing the visual position embeddings could offer better performance? They are completely independent of the context length of the texual part of the model right?

NormXU commented 1 year ago

@rbvh Yes, as you pointed out, the visual input and text input are quite independent to each other, so I think it is safe to removed it. Besides, I want to augment my training dataset by randomly adjusting the coordinates of each bounding box. This augmentation method cannot work with visual inputs.