TransformerLensOrg / TransformerLens

A library for mechanistic interpretability of GPT-style language models
https://transformerlensorg.github.io/TransformerLens/
MIT License
1.41k stars 273 forks source link

[Bug Report] Qwen model implementation is too inaccurate #683

Open bryce13950 opened 1 month ago

bryce13950 commented 1 month ago

The whole Qwen model family seems to be pretty inaccurate. I have not done complete benchmarks to determine where the issue is yet. That still needs to be done to fine the specific area causing the error. This is probably due to einsum usage, and a slightly inaccuracy from the Transformers implementation. To solve this, we need to remove any potentially troublesome einsums that are in the model, which verify that any components used have similar implementations to transformers, which may result in the creation of more components in TransformerLens.

Describe the bug The output is currently switching languages in what seems to be all models. I tested three different models, and I found that when putting in English, the output will sometimes be a bit of nonsense, and often with some Chinese mixed in. I then decided to generate a bit in Chinese, which resulted in kanji Japanese being generated. This is particularly interesting, since the characters I was using are in both Chinese and Japanese, but if the model mistook my input as Japanese, it should have still generated the same writing style.

Code example

model = HookedTransformer.from_pretrained_no_processing(
    "Qwen/Qwen-1_8B-Chat",
    fp32=True,
    dtype=torch.float32,
)
model.generate(
    "hello my name is ",
    verbose=False,
)

System Info This was found in colab using various versions of TransformerLens 2.x and 1.x.

Additional context

Checklist

mntss commented 1 month ago

I think the problem you're seeing is caused by the prompt formatting and not the implementation differences.

I compared the TL to HF model and while there are some small differences in activations the logit differences seem negligible (max 0.0009 2.62e-06 diff in softmax outputs).

This model uses chatml template for the inputs, for your example, the format should be something like this: '<|im_start|>system\n<|im_end|>\n<|im_start|>user\nhello my name is <|im_end|>\n<|im_start|>assistant\n'

While the example above will use: <|endoftext|>hello my name is

I noticed that TL will fallback to prepend to EOS token if BOS is absent, which seems incorrect in this case.

Code:

import torch
from torch.nn import functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformer_lens import HookedTransformer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-1_8B-Chat",
    trust_remote_code=True,
    add_bos_token=True,
)
model_hf = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-1_8B-Chat",
    trust_remote_code=True,
    device_map="cuda",
    fp32=True,
).eval()

model = HookedTransformer.from_pretrained_no_processing(
    "Qwen/Qwen-1_8B-Chat",
    dtype=torch.float32,
    device="cuda",
)

encoding = tokenizer("hello my name is ", return_tensors="pt")

response_hf = model_hf(encoding.input_ids.cuda())
logits_tl = model(encoding.input_ids.cuda())

diff = logits_tl - response_hf.logits
prob_diff = F.softmax(logits_tl, dim=-1) - F.softmax(response_hf.logits, dim=-1)
prob_diff.std().item(), prob_diff.mean().item(), prob_diff.max().item()
bryce13950 commented 1 month ago

@mntss It is completely certain that the issue is implementation inaccuracy. This topic has been discussed a lot over the last few months. If you are curious about the details, I would refer you to the issues https://github.com/TransformerLensOrg/TransformerLens/issues/570, and https://github.com/TransformerLensOrg/TransformerLens/issues/685 which has been opened today. All three of these issues are related to the same problem, but the problem is systemic. If you are curious about the fix for something like this, then I would refer you to PR https://github.com/TransformerLensOrg/TransformerLens/pull/652, which resolved the issue for mixtral, but the issue remains across most implementations. We are simply in the process of identifying which implementations are more impacted at this point.

The benchmark you ran is many magnitudes worse than other supported models. e.g. Mixtral was off by 1 hundred thousandth, yet it was generating French, Spanish, and German on English prompts. 0.009 is a remarkably bad result for the benchmark you are looking at. If you are curious to help resolve the issue, then let me know, and I can walk you through what the resolution process is. 95% of the problem is the usage of einsum, which is not used at all in any official implementation within Transformers. Once those implementations are removed, the inaccuracies in almost all cases clear up. The issue about EOS tokens could also be apart of the problem, but it is likely 10-20 different factors as was the case with Mixtral with the vast majority of the issue being caused by einsum.

I am in the middle of revamping weight conversions at the moment, so that benchmarking tools can be built that will automate the benchmark you ran, but it is a pretty involved process. Once I am done building benchmarking tools, I will be analyzing each implementation currently supported by TransformerLens to identify where the inaccuracies are most pronounced. If you are interested in helping with this process let me know! We are looking for people who are interested in helping resolve this problem across the board.

mntss commented 1 month ago

Thanks for clarifying! From the issue description, I assumed that the problem is the generated tokens being in Chinese and that behavior is the same for the HF implementation for this prompt, not the result of the inaccuracy.

Also, I found out that Qwen model does not respect the torch_dtype parameter, the actual max prob difference is 2.62e-6 (mean 3.26-e-09, std -1.78e-13). I updated my comment. Here is the logit diff: image

It would be helpful for me to understand the target implementation accuracy for TL. I noticed this test which expects a perfect match for GPT-2: https://github.com/TransformerLensOrg/TransformerLens/blob/main/tests/integration/test_match_huggingface.py

In the case of the Qwen model the attention modules seem like the main issue. However, the outputs of the MLP modules also do not match perfectly due to the fact that the weights are stored in different orientations. Eg. F.linear(inp, W_gate.T.contiguous()) == F.linear(inp, W_gate.T)