aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
424 stars 136 forks source link

RuntimeError: neuronx-cc failed with -9 on OPT 1.3B #706

Closed lvnair3 closed 1 year ago

lvnair3 commented 1 year ago

Task

OPT 1.3B inference on Wikitext2 using E4M3 on Trainium Trn1

Inference Script

Full script is attached script.zip. Essentially, it is an adaptation of the run_clm_no_trainer.py script from HuggingFace here.

I've adapted the script to only perform inference and no training. I also adapted it to include the following block of code for NeuronX:

import torch_neuronx
inputs = next(iter(eval_dataloader))
example = (inputs['input_ids'], inputs['attention_mask'], inputs['labels'])
model.eval()

orig_func = model.forward
def forward_with_labels(input_ids, attention_mask, labels):
        return orig_func(input_ids, attention_mask, labels=labels, return_dict=False)

 model.forward = forward_with_labels
 model = torch_neuronx.trace(model, example, compiler_args=['--auto-cast-type', 'fp8_e4m3'])

Run command

MODEL_NAME=lnair/opt-1.3b-wikitext2
python -u eval_opt.py \
    --model_name_or_path $MODEL_NAME \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_eval_batch_size 1 \
    --seed 42 \
    --output_dir ./tmp/test-clm

NOTE: The model lnair/opt-1.3b-wikitext2 is a fine-tuned version of facebook/opt-1.3b (no architectural changes here). Nevertheless, it fails on both the lnair/opt-1.3b-wikitext2 and facebook/opt-1.3b checkpoints.

Error

NOTE: The script works for OPT 125M and OPT 350M models without any errors. However, it fails on the OPT 1.3B model with the following error:

 File "eval_opt.py", line 519, in main
    model = torch_neuronx.trace(model, example, compiler_args=['--auto-cast-type', 'fp8_e4m3'])
  File "/home/usern/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py", line 272, in trace
    neff_filename, metaneff, flattener, packer = _trace(
  File "/home/usern/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py", line 340, in _trace
    neff_filename = hlo_compile(model_dir, compiler_workdir, compiler_args)
  File "/home/usern/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py", line 232, in hlo_compile
    raise RuntimeError(f'neuronx-cc failed with {status}')
RuntimeError: neuronx-cc failed with -9

Thanks in advance!

aws-donkrets commented 1 year ago

Hi lvnair3 - typically a"-9" error to indicates the host OS killed the compiler due to out of memory issues. Since your model compiles fine with smaller parameter sizes, I suspect this is what is occurring with the 1.3B model. Suggest to try compiling using an instance with a larger memory config.

lvnair3 commented 1 year ago

Thank you! Although the RuntimeError: neuronx-cc failed with -9 resolved with the larger memory config. I'm getting a different error for the same script: RuntimeError: neuronx-cc failed with 1. I found this issue reported here as well: #690 . I've created a new issue for this error here: #708 . So please feel free to close this one as resolved.