Closed lvnair3 closed 1 year ago
Hi lvnair3 - typically a"-9" error to indicates the host OS killed the compiler due to out of memory issues. Since your model compiles fine with smaller parameter sizes, I suspect this is what is occurring with the 1.3B model. Suggest to try compiling using an instance with a larger memory config.
Thank you! Although the RuntimeError: neuronx-cc failed with -9
resolved with the larger memory config. I'm getting a different error for the same script: RuntimeError: neuronx-cc failed with 1
. I found this issue reported here as well: #690 . I've created a new issue for this error here: #708 . So please feel free to close this one as resolved.
Task
OPT 1.3B inference on Wikitext2 using E4M3 on Trainium
Trn1
Inference Script
Full script is attached script.zip. Essentially, it is an adaptation of the
run_clm_no_trainer.py
script from HuggingFace here.I've adapted the script to only perform inference and no training. I also adapted it to include the following block of code for NeuronX:
Run command
NOTE: The model
lnair/opt-1.3b-wikitext2
is a fine-tuned version offacebook/opt-1.3b
(no architectural changes here). Nevertheless, it fails on both thelnair/opt-1.3b-wikitext2
andfacebook/opt-1.3b
checkpoints.Error
NOTE: The script works for OPT 125M and OPT 350M models without any errors. However, it fails on the OPT 1.3B model with the following error:
Thanks in advance!