Running Llama3 Returns Tensor Allocate Status 2

pedrohernandezgeladocma commented 1 month ago

When running the notebook for inference using Llama3

import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx import LlamaForSampling
from transformers import LlamaForCausalLM, LlamaTokenizer, PreTrainedTokenizerFast
from transformers_neuronx import LlamaForSampling, NeuronConfig, GQA, QuantizationConfig
from transformers_neuronx.config import GenerationConfig 

# Set this to the Hugging Face model ID
model_id = "meta-llama/Meta-Llama-3-8B"

neuron_config = NeuronConfig(
                    on_device_embedding=False,
                    attention_layout='BSH',
                    fuse_qkv=True,
                    group_query_attention=GQA.REPLICATED_HEADS,
                    quant=QuantizationConfig(),
                    on_device_generation=GenerationConfig(do_sample=True)
              )

# load meta-llama/Llama-3-8B to the NeuronCores with 24-way tensor parallelism and run compilation
neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=24, amp='f16', n_positions=4096)
neuron_model.to_neuron()

There is a return error code of:

nrt_tensor_allocate status=2 message="Invalid"

Edit: instance type -> inf2.8xlarge Ubuntu 22 AMI

No dependencies issues as far as I understand but cannot trace the error beyond the function, also no references to the error on Troubleshooting

aws-taylor commented 1 month ago

Thanks @pedrohernandezgeladocma, we're taking a look.

pedrohernandezgeladocma commented 1 month ago

@aws-taylor I think this issue may be related to -> https://github.com/aws-neuron/aws-neuron-sdk/issues/749

aws-taylor commented 1 month ago

Hello @pedrohernandezgeladocma, we suspect the issue may be related to your instance type. Can you try again with a larger instance type? In particular, tp_degree=24 is too small for an inf2.8xlarge.

aws-neuron / aws-neuron-sdk

Running Llama3 Returns Tensor Allocate Status 2 #891