[Bug Report] The output from HookedTransformer is not identical compared to Huggingface model for Lllama 3

iamsimha commented 1 month ago

If you are submitting a bug report, please fill in the following details and use the tag [bug].

Describe the bug The generations from huggingface model (LlamaForCausalLM) and HookedTransformer are different

Code example

# relevant imports 
import torch
from transformer_lens import HookedTransformer
from transformers import AutoTokenizer, LlamaForCausalLM

# Generate sample using greedy sampling

def generate_tokens(model, tokenizer, prompt, max_tokens):
    tokenised_prompt = tokenizer(tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False),
                                 padding=False,truncation=False, return_tensors="pt").input_ids
    generated_response = ""
    prompt_length = tokenised_prompt.shape[1]
    all_tokens = torch.zeros(1, prompt_length + max_tokens, dtype=torch.long)
    all_tokens[:, 0:prompt_length] = tokenised_prompt
    with torch.inference_mode():
        for i in range(max_tokens):
            result = model(all_tokens[:, 0:prompt_length + i])
            if isinstance(model, LlamaForCausalLM):
                logits = result.logits
            else:
                logits = result
            next_tokens = logits[:, -1, :].argmax(dim=-1) # greedy sampling (temperature=0)
            all_tokens[:, prompt_length + i] = next_tokens
            generated_response += tokenizer.decode(next_tokens, skip_special_token=True)
    return generated_response

MODEL_PATH = 'meta-llama/Meta-Llama-3-8B-Instruct'

# Load huggingface model and tokenizer using LlamaForCausalLM and shard it on 8 gpus

hf_model = LlamaForCausalLM.from_pretrained(
        MODEL_PATH
)
hf_tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# Load HookedTransformer model 
hooked_model = HookedTransformer.from_pretrained_no_processing(
    MODEL_PATH,
    device="cuda",
    dtype=torch.float16,
    default_padding_side='left',
    n_devices=1,
)

prompt = """what is the population of Delhi"""

generate_tokens(hooked_model, hf_tokenizer, prompt, 30)

>> '<|start_header_id|>assistant<|end_header_id|>\n\nAs of 2021, the estimated population of Delhi is approximately 31.2 million (31,181,000) people'

generate_tokens(hf_model, hf_tokenizer, prompt, 30)

>> '<|start_header_id|>assistant<|end_header_id|>\n\nAs of 2021, the estimated population of Delhi, the capital city of India, is approximately 29.2 million people'

The result from Hooked model is different compared to Huggingface model. The outputs from Huggingface model is identical to results from LLM arena, therefore there is likely a bug in HookedTransformer implementation.

System Info Describe the characteristic of your environment:

Describe how transformer_lens was installed (pip, docker, source, ...): !pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping colorama
What OS are you using? (Linux, MacOS, Windows) : Linux
Python version (We suppourt 3.7 -3.10 currently): 3.10

Additional context Add any other context about the problem here.

Checklist

[x] I have checked that there is no similar issue in the repo (required)

This may be related to #385

bryce13950 commented 1 month ago

Thank you for being so thorough on this. We are in the middle of working out some tools that will make benchmarking models easier, and I think that is going to help us debug this a lot. If anyone sees this, and is able to take this on, that would be highly appreciative in the time being. If not, then hopefully those tools will be ready soon, so that we can more easily figure out what the deal is here.

terarachang commented 1 month ago

I observed similar problems withMODEL_PATH = 'meta-llama/Meta-Llama-3-8B' on in-context learning

pip install transformer_lens
OS: Linux
Python version: 3.9

jkminder commented 2 weeks ago

I observe the same issue. It might be related to #570. There seems to be quite a large difference in logits between the two models.

from matplotlib import pyplot as plt
import seaborn as sns
import torch
from transformer_lens import HookedTransformer
from transformers import AutoTokenizer, LlamaForCausalLM

MODEL_PATH = 'meta-llama/Meta-Llama-3-8B-Instruct'

# Load huggingface model and tokenizer using LlamaForCausalLM and shard it on 8 gpus

hf_model = LlamaForCausalLM.from_pretrained(
        MODEL_PATH
).to('cuda:0')
hf_tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# Load HookedTransformer model 
hooked_model = HookedTransformer.from_pretrained_no_processing(
    MODEL_PATH,
    device="cuda",
    dtype=torch.float32,
    default_padding_side='left',
    n_devices=1,
)

a_large_chunk_of_text = "Generate is not a good way to check a model is running properly. Can you run the following and share the results?"
tokens = hf_tokenizer(a_large_chunk_of_text, return_tensors="pt").input_ids.to('cuda:0')
logits_hf = hf_model(tokens).logits
logits_tl = hooked_model(tokens, return_type="logits")

logits_diff = (logits_hf - logits_tl)
logits_diff_last = logits_diff[:, -1, :]
print("TF Greedy:", logits_hf[:, -1, :].argmax(dim=-1), "Logit:", logits_hf[:, -1, :].max())
print("TL Greedy:", logits_tl[:, -1, :].argmax(dim=-1), "Logit:", logits_tl[:, -1, :].max())
# histogram of the difference between the logits from the hooked model and the huggingface model
sns.histplot(logits_diff_last.flatten().cpu().numpy(), bins=100)
# set title
plt.title("Difference between logits (last) from the hooked model and the huggingface model")

logit-diff

jkminder commented 2 weeks ago

Made a few more investigations. Seems like differences stem from both attn and mlp, (haven't appended the plot but the embedding matrix output is equal).


hf_model_nnsight = NNsight(hf_model)

with hf_model_nnsight.trace(tokens):
    hf_attn_out = hf_model_nnsight.model.layers[0].self_attn.output.save()
    hf_mlp_out = hf_model_nnsight.model.layers[0].mlp.output.save()
    hf_resid_post = hf_model_nnsight.model.layers[0].output.save()
_, cache = hooked_model.run_with_cache(tokens)

layer0_attn_out_diff = hf_attn_out[0] - cache["blocks.0.hook_attn_out"]
layer0_mlp_out_diff = hf_mlp_out[0] - cache["blocks.0.hook_mlp_out"]
layer0_resid_post_diff = hf_resid_post[0] - cache["blocks.0.hook_resid_post"]

# histogram of the difference between the attention output from the hooked model and the huggingface model
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 5))
sns.histplot(layer0_attn_out_diff.detach().flatten().cpu().numpy(), bins=100, ax=ax1)
ax1.set_title("Difference between attention output")
# histogram of the difference between the mlp output from the hooked model and the huggingface model
sns.histplot(layer0_mlp_out_diff.detach().flatten().cpu().numpy(), bins=100, ax=ax2)
ax2.set_title("Difference between mlp output")
# histogram of the difference between the residual post output from the hooked model and the huggingface model
sns.histplot(layer0_resid_post_diff.detach().flatten().cpu().numpy(), bins=100, ax=ax3)
ax3.set_title("Difference between residual post output")
plt.suptitle("Difference between outputs from the hooked model and the huggingface model in layer 0")

bryce13950 commented 2 weeks ago

@jkminder If you are looking into this, the thing to do is check the implementation of the model in transformers (https://github.com/huggingface/transformers/tree/main/src/transformers/models), and check how the MLPs work there. I found an issue in Mixtral where the implementation in TransformerLens was adding the bias to the result, but the implementation in transformers was not interacting with bias at all. There's a good chance that something similar is going on here. I am in the middle of basically completely reworking the MLPs, so you might want to look at the branch mlp-cleanup. You will need to make sure it is using the right MLP though. At the moment I have split the gated MLP into two classes, but I have not integrated the new one to make sure it is used when it needs to be used. The part that is probably causing an issue, which I have not looked at at all yet is the attention part of this. With that, the same method of debugging would work. Checkout what the transformers implementation is, and see if you can find where we are going wrong. If you wanted to start with attention it may be a better use of your time, since I have the MLPs kind of half taken apart right now. Also, there is this notebook that you may find helpful https://github.com/TransformerLensOrg/TransformerLens/blob/dev/debugging/comparing-to-huggingface.ipynb. It looks like you already basically created your own version of it, but it may be worth looking at.

jkminder commented 2 weeks ago

@bryce13950 thanks, i already started digging a bit deeper here. Seems like the weights are slightly different, even though an equality test doesn't show this. Would be great to have a second pair of eyes on this! I am examining the MLPs in the first layer of llama3 (following my code above)

Test: are all weights the same? -> Yes

(hooked_model.blocks[0].mlp._parameters["W_gate"] == hf_model.model.layers[0].mlp.gate_proj.weight.T).all(), (hooked_model.blocks[0].mlp._parameters["W_in"] == hf_model.model.layers[0].mlp.up_proj.weight.T).all(), (hooked_model.blocks[0].mlp._parameters["W_out"] == hf_model.model.layers[0].mlp.down_proj.weight.T).all()

(tensor(True, device='cuda:0'), tensor(True, device='cuda:0'), tensor(True, device='cuda:0'))

Test: Is the output equal? No

test_in = torch.ones(1, 1, 4096).to('cuda:0')
hf_out = hf_model.model.layers[0].mlp(test_in).detach().cpu()
hooked_out = hooked_model.blocks[0].mlp(test_in).detach().cpu()
isclose = torch.isclose(hf_out,hooked_out)
(~is_close).sum(), is_close.shape

(tensor(9), tensor(False))

Test: Zooming in on W_gate -> same issue

from fancy_einsum import einsum
hf_w_gate_out = hf_model.model.layers[0].mlp.gate_proj(test_in).detach().cpu()
# https://github.com/TransformerLensOrg/TransformerLens/blob/318236402ddcc9cabace3f2fca40c71f0c2e9e57/transformer_lens/components/gated_mlp.py#L102C13-L107C18
hooked_w_gate_out = einsum(
                    "batch pos d_model, d_model d_mlp -> batch pos d_mlp",
                    test_in,
                    hooked_model.blocks[0].mlp.W_gate,
                ).detach().cpu()
is_close = torch.isclose(hf_w_gate_out, hooked_w_gate_out)
(~is_close).sum(), is_close.shape

(tensor(9), torch.Size([1, 1, 14336]))

Test: is this an issue with fancy einsum/does opt einsum work? No

from opt_einsum import contract
hf_w_gate_out = hf_model.model.layers[0].mlp.gate_proj(test_in).detach().cpu()
# Adapted from https://github.com/TransformerLensOrg/TransformerLens/blob/318236402ddcc9cabace3f2fca40c71f0c2e9e57/transformer_lens/components/gated_mlp.py#L102C13-L107C18
hooked_w_gate_out = contract("bpk, kd -> bpd",
                    test_in,
                    hooked_model.blocks[0].mlp.W_gate,
                ).detach().cpu()
is_close = torch.isclose(hf_w_gate_out, hooked_w_gate_out)
(~is_close).sum(), is_close.shape

(tensor(9), torch.Size([1, 1, 14336]))

Test: How about just normal torch matrix product? Same issue

hooked_w_gate_out = (test_in @ hooked_model.blocks[0].mlp.W_gate).detach().cpu()
is_close = torch.isclose(hf_w_gate_out, hooked_w_gate_out)
(~is_close).sum(), is_close.shape

(tensor(9), torch.Size([1, 1, 14336]))

Test: Convert to linear and then calc -> no

lin = torch.nn.Linear(4096, 14336, bias=False)
lin.weight = torch.nn.Parameter(hooked_model.blocks[0].mlp.W_gate.T)
hooked_w_gate_out = lin(test_in).detach().cpu()
is_close = torch.isclose(hf_w_gate_out, hooked_w_gate_out)
(~is_close).sum(), is_close.shape

(tensor(9), torch.Size([1, 1, 14336]))

Test: Verify that the implementation of linear is not the problem -> no

hooked_w_gate_out = torch.nn.functional.linear(test_in, hooked_model.blocks[0].mlp.W_gate.T).detach().cpu()
hf_w_gate_out = torch.nn.functional.linear(test_in, hf_model.model.layers[0].mlp.gate_proj.weight).detach().cpu()
is_close = torch.isclose(hf_w_gate_out, hooked_w_gate_out)
(~is_close).sum(), (hf_w_gate_out != hooked_w_gate_out).sum()

(tensor(9), tensor(4819))

Test: Replace the hooked weights with the HF weights -> solves it

hooked_model.blocks[0].mlp.W_gate.data = hf_model.model.layers[0].mlp.gate_proj.weight.T
hf_w_gate_out = hf_model.model.layers[0].mlp.gate_proj(test_in).detach().cpu()
# https://github.com/TransformerLensOrg/TransformerLens/blob/318236402ddcc9cabace3f2fca40c71f0c2e9e57/transformer_lens/components/gated_mlp.py#L102C13-L107C18
hooked_w_gate_out = einsum(
                    "batch pos d_model, d_model d_mlp -> batch pos d_mlp",
                    test_in,
                    hooked_model.blocks[0].mlp.W_gate,
                ).detach().cpu()
is_close = torch.isclose(hf_w_gate_out, hooked_w_gate_out)
(~is_close).sum(), is_close.shape

(tensor(0), torch.Size([1, 1, 14336]))

I'm really unsure what to make of this behaviour, but I guess it could explain why we are seeing differences in outputs. I assume there are slight changes in the weights, which result in numerical issues. Wierdly when comparing them they all are equal. Any insights on what could be done to solve this? I guess one could switch to torch.functional calls for the matrix multiplications.

Also in case you don't wanna copy paste, here is my notebook: https://gist.github.com/jkminder/d05d708f3f93c66037ac7f0c352eefa4

bryce13950 commented 1 week ago

@jkminder Sorry for the delay on getting back to you. we have some pretty major changes coming to existing components in the same vein as what you are bringing up. At the moment, I am completely focused on wrapping up a number of changes for mixtral, which is similar to this, and after that I intend to start playing with this model to see what the status is at that point. Hopefully that will be all done, merged, and released within the next few days. Let's relook this at that point. Some more info on the details of what is being discussed can be found in #645. Regardless, more work is likely to be needed here, but this will benefit from the work being done.

TransformerLensOrg / TransformerLens

[Bug Report] The output from HookedTransformer is not identical compared to Huggingface model for Lllama 3 #615

Checklist