aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
423 stars 136 forks source link

RuntimeError: neuronx-cc failed with -1 (OPT 1.3B) #708

Closed lvnair3 closed 10 months ago

lvnair3 commented 11 months ago

Task

OPT 1.3B inference on Wikitext2 using E4M3 on Trainium Trn1

Inference Script

Full script is attached script.zip. Essentially, it is an adaptation of the run_clm_no_trainer.py script from HuggingFace here.

I've adapted the script to only perform inference and no training. I also adapted it to include the following block of code for NeuronX:

import torch_neuronx
inputs = next(iter(eval_dataloader))
example = (inputs['input_ids'], inputs['attention_mask'], inputs['labels'])
model.eval()

orig_func = model.forward
def forward_with_labels(input_ids, attention_mask, labels):
        return orig_func(input_ids, attention_mask, labels=labels, return_dict=False)

 model.forward = forward_with_labels
 model = torch_neuronx.trace(model, example, compiler_args=['--auto-cast-type', 'fp8_e4m3'])

Run command

MODEL_NAME=lnair/opt-1.3b-wikitext2
python -u eval_opt.py \
    --model_name_or_path $MODEL_NAME \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_eval_batch_size 1 \
    --seed 42 \
    --output_dir ./tmp/test-clm

NOTE: The model lnair/opt-1.3b-wikitext2 is a fine-tuned version of facebook/opt-1.3b (no architectural changes here). Nevertheless, it fails on both the lnair/opt-1.3b-wikitext2 and facebook/opt-1.3b checkpoints.

Error

This script for OPT 1.3B fails with the following error:

023-07-14T18:16:48Z ERROR 26345 [Tensorizer]: Transformation error on operator: _dot.3766
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: ***************************************************************
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:  An Internal Compiler Error has occurred
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: ***************************************************************
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: 
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: Error message:  type object 'StaticProfiler' has no attribute 'matmul_compute_dtype_fp8e4'
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: 
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: Error class:    AttributeError
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: Error location: Unknown
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: Command line:   /home/lakshmi/aws_neuron_venv_pytorch/bin/neuronx-cc compile /tmp/tmp_quy15xg/model --framework XLA --target trn1 --output /tmp/tmp_quy15xg/graph.neff --auto-cast-type fp8_e4m3
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: 
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: Internal details:
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/CommandDriver.py", line 259, in neuronxcc.driver.CommandDriver.CommandDriver.run
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1089, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1040, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1065, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1069, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/jobs/Frontend.py", line 359, in neuronxcc.driver.jobs.Frontend.Frontend.runSingleInput
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/driver/jobs/Frontend.py", line 164, in neuronxcc.driver.jobs.Frontend.Frontend.runXLAFrontend
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Penguin.py", line 299, in neuronxcc.starfish.penguin.Penguin.runPenguin
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 150, in neuronxcc.starfish.penguin.Frontend.tensorizeXla
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 151, in neuronxcc.starfish.penguin.Frontend.tensorizeXla
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 159, in neuronxcc.starfish.penguin.Frontend.tensorizeXla
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 212, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaFromFile
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 216, in neuronxcc.starfish.penguin.Compile.compile_module
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 218, in neuronxcc.starfish.penguin.Compile.compile_module
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 261, in neuronxcc.starfish.penguin.Compile.compile_module
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 269, in neuronxcc.starfish.penguin.Compile.genenerate_code_and_metadata_for_module
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 122, in neuronxcc.starfish.penguin.Compile.generate_code_and_metadata
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 143, in neuronxcc.starfish.penguin.DotTransform.DotTransform.runOnFunction
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 196, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 191, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 208, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 210, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 211, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 240, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 242, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 341, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformFunction
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 342, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformFunction
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 333, in neuronxcc.starfish.penguin.DotTransform.DotTransform.runTransforms
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 322, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmts
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 134, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 374, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 377, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 134, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 364, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmt
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 134, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 364, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmt
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 134, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 364, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmt
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 134, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 364, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmt
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 134, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 364, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmt
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 134, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 364, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmt
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 134, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 364, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmt
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 134, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/passes/StaticProfiler.py", line 357, in neuronxcc.starfish.penguin.targets.tonga.passes.StaticProfiler.StaticProfiler.transformMatMulOp
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: 
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: Version information:
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   NeuronX Compiler version 2.6.0.19+3d819e565
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   Python version 3.8.10
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   HWM version 2.6.0.0-826e77395
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   NEFF version Dynamic
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   TVM not available
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   NumPy version 1.21.6
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]:   MXNet not available
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: 
2023-07-14T18:16:48Z ERROR 26345 [neuronx-cc]: Artifacts stored in: /home/lakshmi/neuronxcc-7vdnjjpj
Traceback (most recent call last):
  File "eval_opt.py", line 559, in <module>
    main()
  File "eval_opt.py", line 519, in main
    model = torch_neuronx.trace(model, example, compiler_args=['--auto-cast-type', 'fp8_e4m3'])
  File "/home/lakshmi/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py", line 309, in trace
    neff_filename = hlo_compile(model_dir, compiler_workdir, compiler_args)
  File "/home/lakshmi/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py", line 232, in hlo_compile
    raise RuntimeError(f'neuronx-cc failed with {status}')
RuntimeError: neuronx-cc failed with 1

Thanks in advance!

lvnair3 commented 11 months ago

This issue was resolved once all the steps given on the Neuron-SDK here are followed (I had a missed a few of them earlier): https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20.html.

shebbur-aws commented 11 months ago

Thank you Lakshmi for reporting this. While we are trying to repro the issue on our end, could you try without using --auto-cast-type', 'fp8_e4m3' option to cast the model to fp8_e4m3 type. By default, we cast to bf16. You could also try with auto-cast-type', 'fp16' and see if it unblocks you.

shebbur-aws commented 11 months ago

Glad you were able to resolve your issue

jluntamazon commented 10 months ago

Closing this issue since it appears to be resolved