Open Fireblossom opened 1 year ago
Hi,
thx again for another PR. It would be nice to use the official HF version for MAGMA. However, last time we tried to implement this we noticed a slight difference in model outputs which we could not really get to the bottom of.
I'm very careful with these kinds of changes, so it would be great if you could compare the logits for some example inputs before/after your change.
Best,
Constantin
Hi,
thx again for another PR. It would be nice to use the official HF version for MAGMA. However, last time we tried to implement this we noticed a slight difference in model outputs which we could not really get to the bottom of.
I'm very careful with these kinds of changes, so it would be great if you could compare the logits for some example inputs before/after your change.
Best,
Constantin
Hi Constantin,
Thank you for reading my changes. Regarding the slight difference, I think it is hard for me to explain something from the example inputs/outputs right now.
However, a change in the structure of the model may be the cause.
In the old version HF, you used GPT-Neo to simulate GPT-J, GPTNeoMLP has no activation function.
in the new version HF, there is an activation function NewGELUActivation()
added.
before:
(0): GPTNeoBlock(
(ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
(attn): GPTNeoAttention(
(attention): GPTNeoSelfAttention(
(attn_dropout): Dropout(p=0, inplace=False)
(resid_dropout): Dropout(p=0, inplace=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(out_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
)
(mlp): Sequential(
(0): GPTNeoMLP(
(c_fc): Linear(in_features=4096, out_features=16384, bias=True)
(c_proj): Linear(in_features=16384, out_features=4096, bias=True)
(dropout): Dropout(p=0, inplace=False)
)
(1): Adapter(
(adapter): Sequential(
(0): Linear(in_features=4096, out_features=1024, bias=True)
(1): ReLU()
(2): Linear(in_features=1024, out_features=4096, bias=True)
)
)
)
)
at present (without adding adapters) GPT-Neo:
(0): GPTNeoBlock(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPTNeoAttention(
(attention): GPTNeoSelfAttention(
(attn_dropout): Dropout(p=0.0, inplace=False)
(resid_dropout): Dropout(p=0.0, inplace=False)
(k_proj): Linear(in_features=768, out_features=768, bias=False)
(v_proj): Linear(in_features=768, out_features=768, bias=False)
(q_proj): Linear(in_features=768, out_features=768, bias=False)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPTNeoMLP(
(c_fc): Linear(in_features=768, out_features=3072, bias=True)
(c_proj): Linear(in_features=3072, out_features=768, bias=True)
(act): NewGELUActivation()
(dropout): Dropout(p=0.0, inplace=False)
)
)
GPT-J:
(0): GPTJBlock(
(ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
(attn): GPTJAttention(
(attn_dropout): Dropout(p=0.0, inplace=False)
(resid_dropout): Dropout(p=0.0, inplace=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(out_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): GPTJMLP(
(fc_in): Linear(in_features=4096, out_features=16384, bias=True)
(fc_out): Linear(in_features=16384, out_features=4096, bias=True)
(act): NewGELUActivation()
(dropout): Dropout(p=0.0, inplace=False)
)
)
Hope this helps you!
Best,
Changxu
Hmm this puzzles me a bit, but in any case, unless consistent behavior with the old version is ensured (e.g. by checking that all the hidden states are the same for a couple of example inputs) I will not merge these changes.
Let me know if you manage to do it and thx for the effort.
Best,
Constantin
Hello, did you manage to make this work?
Hello, did you manage to make this work?
Hi, I can confirm that the modification itself is runnable and can be fine-tuned by the same method as the MAGMA paper. As discussed above, the modified branch may have inconsistent output with the checkpoint provided by the repo, so this PR will not be merged. But I don't have a lot of time to dive into this right now. Maybe I will do it later.
Hi, thank you for the fast answer! do you have a working checkpoint? the default one has some dimensionality differences and I'd avoid to copy and paste the tensors by hand. Can you upload it somewhere?
Hi, thank you for the fast answer! do you have a working checkpoint? the default one has some dimensionality differences and I'd avoid to copy and paste the tensors by hand. Can you upload it somewhere?
The data I use for fine-tuning is in a completely different domain so I'm afraid my checkpoint can't meet your needs right now.
ah, ok, thank you anyway
load the GPT-J checkpoint newer version of transformers