aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
421 stars 136 forks source link

AttributeError in neuron_parallel_compile with NoneType compiled_hlo_status During Model Recompilation #842

Closed arashsadrieh closed 3 weeks ago

arashsadrieh commented 4 months ago

Environment:

PyTorch version: 1.13.1+cu117 Training model: LLaMA 70B Cluster configuration: 32 node cluster on [specify instance types if relevant] Issue Description: While training the LLaMA 70B model on a 32 node cluster, I encountered a runtime error during an attempt to adjust the learning rate. I have model that trains (with instable loss) - I reduced the learning rate from 0.0015 to 0.00015 and after 10 optimisation steps there an unexpected error in the neuron_parallel_compile process as follows:

2024-02-29 04:03:01.000535:  318159  INFO ||NEURON_PARALLEL_COMPILE||: sub-process 0 got exception: Traceback (most recent call last):
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_parallel_compile.py", line 156, in parallel_compile
    for compiled_hlo, (status, retry, compile_time) in compiled_hlo_status.items():
AttributeError: 'NoneType' object has no attribute 'items'

2024-02-29 04:03:01.000535:  318159  INFO ||NEURON_PARALLEL_COMPILE||: sub-process 1 got exception: Traceback (most recent call last):
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_parallel_compile.py", line 156, in parallel_compile
    for compiled_hlo, (status, retry, compile_time) in compiled_hlo_status.items():
AttributeError: 'NoneType' object has no attribute 'items'

2024-02-29 04:03:01.000535:  318159  INFO ||NEURON_PARALLEL_COMPILE||: sub-process 2 got exception: Traceback (most recent call last):
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_parallel_compile.py", line 156, in parallel_compile
    for compiled_hlo, (status, retry, compile_time) in compiled_hlo_status.items():
AttributeError: 'NoneType' object has no attribute 'items'

2024-02-29 04:03:01.000535:  318159  INFO ||NEURON_PARALLEL_COMPILE||: sub-process 3 got exception: Traceback (most recent call last):
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_parallel_compile.py", line 156, in parallel_compile
    for compiled_hlo, (status, retry, compile_time) in compiled_hlo_status.items():
AttributeError: 'NoneType' object has no attribute 'items'
jeffhataws commented 4 months ago

Thanks @arashsadrieh . We have identified a fix for these messages and it will be available in an upcoming release.

Furthermore, these messages only affect the final compilation status reporting but not the actual compilation, so you can ignore them and proceed to the actual training run.

aws-taylor commented 1 month ago

Hello @arashsadrieh,

We believe this is fixed in our 2.18 release (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#id8).

chafik-c commented 3 weeks ago

Hello @arashsadrieh, Did you try the 2.18 release?

aws-donkrets commented 3 weeks ago

Hi arashsadrieh - Please try the 2.18 (or later) release to verify you issue has been resolved. If you still see a problem or need more support, feel free to open a new ticket.