Discrepancies Between GPU and Neuron-based Outputs for GPTJ Model on inf2.24xlarge

ho4040 commented 1 year ago

I attempted to use this model through inf2.24xlarge. This model is based on the GPTJ architecture, but when I run this model based on Neuron, the results differ greatly from those on a GPU-based system. Completely meaningless words are outputted. It works fine with GPU.

Below is the compilation code:

from transformers.models.auto import AutoModelForCausalLM
import torch
from transformers_neuronx.module import save_pretrained_split
hf_model = AutoModelForCausalLM.from_pretrained('PygmalionAI/pygmalion-6b', low_cpu_mem_usage=True)
def amp_callback(model, dtype):
    for block in model.transformer.h:
        block.attn.to(dtype)
        block.mlp.to(dtype)
    model.lm_head.to(dtype)
amp_callback(hf_model, torch.float16)
save_pretrained_split(hf_model, './pygmalion-6b-split')

Below is the inference code:

import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.gptj.model import GPTJForSampling
neuron_model = GPTJForSampling.from_pretrained('./pygmalion-6b-split', n_positions=1024, batch_size=1, tp_degree=8, amp='f16')
neuron_model.to_neuron()
# construct a tokenizer and encode prompt text
tokenizer = AutoTokenizer.from_pretrained('PygmalionAI/pygmalion-6b')
batch_prompts = [
    "Jihye's Persona: A 22-year-old woman working part-time at a convenience store in Seoul.\n<START>\nYou: ...\nJihye: Welcome, man.\nYou: hello?\nJihye: ",]
input_ids = torch.as_tensor([tokenizer.encode(text) for text in batch_prompts])
with torch.inference_mode():
    # warmup 
    generated_sequences = neuron_model.sample(input_ids, sequence_length=1024)
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=1024)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

Environment: AMI: Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230720 VENV: aws_neuron_venv_pytorch

aws-mvaria commented 1 year ago

Thanks @ho4040 , We are taking a look and will get back to you shortly.

sheenobu commented 1 year ago

I was able to get adequate output on inf2.8xlarge:

['Jihye\'s Persona: A 22-year-old woman working part-time at a convenience store in Seoul.
<START>
You:...
Jihye: Welcome, man.
You: hello?
Jihye: You can use the bathroom now. I\'ll be right here, waiting. 
Jihye: Please do yourself a favor and be fast about it. I\'m not here for your business. If I had more of that in my store, I wouldn\'t be running as fast to help as I am now. If all of customers were as well behaved as you, my department would be a lot less of a pain to manage.
Jihye: Let\'s not get into any more of an argument. You seem impatient to get back to your business. I\'ll wait for you again when you\'re finished. Good luck.
Jihye: If you\'re finished, I mean. (I\'ve been waiting a while...)
<STOP>
Jihye: *I sigh.*

Shit... I wonder how bad of a week it would have to be for a customer like him...

*It wasn\'t exactly surprising that customers like this were']

I had to make a few changes to get it running on a smaller machine:

smaller params here:

GPTJForSampling.from_pretrained('./pygmalion-6b-split', n_positions=256, batch_size=1, tp_degree=1, amp='f16')

and

neuron_model.sample(input_ids, sequence_length=256)
start = time.time()
neuron_model.sample(input_ids, sequence_length=256)

then run with FI_EFA_FORK_SAFE=1.

Environment: RockyLinux 9.2, Podman container running python 3.8 and transformers_neuronx-0.5.58

I'm not sure what revision of pygmalian I have, could be an old one. Here is the sha256sum of model-00001:

# sha256sum pytorch_model-00001-of-00002.bin
88ba2b44537f444e3fad92dff6962ac8c0b983427523484f98e7acf2d71fd65e  pytorch_model-00001-of-00002.bin

aws-neuron / transformers-neuronx

Discrepancies Between GPU and Neuron-based Outputs for GPTJ Model on inf2.24xlarge #28