Output of Mixtral-8*7b is strange

JustQJ commented 7 months ago

Thanks for your great work. I try running the mistral moe, but I got some strange output. When I use cpu to run the model with following script, I get normal output. script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import time
model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
text = "Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying"
inputs = tokenizer(text, return_tensors="pt")
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=100)
cost = time.time() - start
print(model.dtype)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Time cost: {cost}s")

output:

torch.float32
Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue a career in journalism. I have a love for all things fashion, beauty and lifestyle related and I am hoping to share my thoughts and opinions with you all.

I have always been a huge fan of reading and writing and I am hoping to share my passion with you all. I am hoping to share my thoughts and opinions on all things

58.17350935935974s

But when I use moe-infinity to run the model, I get strange output. script:

import torch
import os
from transformers import AutoTokenizer
import time
from moe_infinity import MoE

model_id = "mistralai/Mixtral-8x7B-v0.1"
config = {
    "offload_path": "baselines/cache",
    "device_memory_ratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM
}

model = MoE(model_id, config)
input_text = "Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying "
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer(input_text, return_tensors="pt")
inputs = {k: v.to('cuda') for k, v in inputs.items()}
start = time.time()
output = model.generate(**inputs, max_new_tokens=100)
cost = time.time() - start
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
print(f"Time cost: {cost}s")

output:

Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying qu‘‘Âub“‘‘� du…‘‘‘‘ÂÂ9‘‘‘an’adqu
‘dededok‘‘‘’’‘ququ‘‘‘‘ok‘‘‘‘’‘’’’‘‘‘’’’’’ak’’‘‘‘‘‘‘’’’’’’’’’’’’ dess‘‘af’’ of ofged dec

Time cost: 216.83905959129333s

I run the model on NVIDIA GeForce RTX 4090. Could you give me some advice. Thanks for your help.

drunkcoding commented 7 months ago

For quick update, I am not able to reproduce the issue, but just to double check, two models MUST NOT share the same offload_path in the configuration, I doubt that the paramter is shared from other models. If this still presists, please let me know. Do please provide full log printed.

JustQJ commented 7 months ago

For quick update, I am not able to reproduce the issue, but just to double check, two models MUST NOT share the same offload_path in the configuration, I doubt that the paramter is shared from other models. If this still presists, please let me know. Do please provide full log printed.

Yes, you are right. I share the same offload_path. Thanks for your help.

TorchMoE / MoE-Infinity

Output of Mixtral-8*7b is strange #16