load GPT-J from HF - Githubissues

Fireblossom commented 1 year ago

load the GPT-J checkpoint newer version of transformers

CoEich commented 1 year ago

Hi,

thx again for another PR. It would be nice to use the official HF version for MAGMA. However, last time we tried to implement this we noticed a slight difference in model outputs which we could not really get to the bottom of.

I'm very careful with these kinds of changes, so it would be great if you could compare the logits for some example inputs before/after your change.

Best,

Constantin

Fireblossom commented 1 year ago

Hi,

thx again for another PR. It would be nice to use the official HF version for MAGMA. However, last time we tried to implement this we noticed a slight difference in model outputs which we could not really get to the bottom of.

I'm very careful with these kinds of changes, so it would be great if you could compare the logits for some example inputs before/after your change.

Best,

Constantin

Hi Constantin,

Thank you for reading my changes. Regarding the slight difference, I think it is hard for me to explain something from the example inputs/outputs right now.

However, a change in the structure of the model may be the cause. In the old version HF, you used GPT-Neo to simulate GPT-J, GPTNeoMLP has no activation function. in the new version HF, there is an activation function NewGELUActivation() added.

before:

(0): GPTNeoBlock(
      (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
      (attn): GPTNeoAttention(
        (attention): GPTNeoSelfAttention(
          (attn_dropout): Dropout(p=0, inplace=False)
          (resid_dropout): Dropout(p=0, inplace=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
      )
      (mlp): Sequential(
        (0): GPTNeoMLP(
          (c_fc): Linear(in_features=4096, out_features=16384, bias=True)
          (c_proj): Linear(in_features=16384, out_features=4096, bias=True)
          (dropout): Dropout(p=0, inplace=False)
        )
        (1): Adapter(
          (adapter): Sequential(
            (0): Linear(in_features=4096, out_features=1024, bias=True)
            (1): ReLU()
            (2): Linear(in_features=1024, out_features=4096, bias=True)
          )
        )
      )
    )

at present (without adding adapters) GPT-Neo:

(0): GPTNeoBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=False)
            (q_proj): Linear(in_features=768, out_features=768, bias=False)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )

GPT-J:

(0): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): GPTJMLP(
          (fc_in): Linear(in_features=4096, out_features=16384, bias=True)
          (fc_out): Linear(in_features=16384, out_features=4096, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )

Hope this helps you!

Best,

Changxu

CoEich commented 1 year ago

Hmm this puzzles me a bit, but in any case, unless consistent behavior with the old version is ensured (e.g. by checking that all the hidden states are the same for a couple of example inputs) I will not merge these changes.

Let me know if you manage to do it and thx for the effort.

Best,

Constantin

nikisalli commented 1 year ago

Hello, did you manage to make this work?

Fireblossom commented 1 year ago

Hello, did you manage to make this work?

Hi, I can confirm that the modification itself is runnable and can be fine-tuned by the same method as the MAGMA paper. As discussed above, the modified branch may have inconsistent output with the checkpoint provided by the repo, so this PR will not be merged. But I don't have a lot of time to dive into this right now. Maybe I will do it later.

nikisalli commented 1 year ago

Hi, thank you for the fast answer! do you have a working checkpoint? the default one has some dimensionality differences and I'd avoid to copy and paste the tensors by hand. Can you upload it somewhere?

Fireblossom commented 1 year ago

Hi, thank you for the fast answer! do you have a working checkpoint? the default one has some dimensionality differences and I'd avoid to copy and paste the tensors by hand. Can you upload it somewhere?

The data I use for fine-tuning is in a completely different domain so I'm afraid my checkpoint can't meet your needs right now.

nikisalli commented 1 year ago

ah, ok, thank you anyway

Aleph-Alpha / magma

load GPT-J from HF #39