Closed arashsadrieh closed 3 weeks ago
Thanks @arashsadrieh . We have identified a fix for these messages and it will be available in an upcoming release.
Furthermore, these messages only affect the final compilation status reporting but not the actual compilation, so you can ignore them and proceed to the actual training run.
Hello @arashsadrieh,
We believe this is fixed in our 2.18 release (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#id8).
Hello @arashsadrieh, Did you try the 2.18 release?
Hi arashsadrieh - Please try the 2.18 (or later) release to verify you issue has been resolved. If you still see a problem or need more support, feel free to open a new ticket.
Environment:
PyTorch version: 1.13.1+cu117 Training model: LLaMA 70B Cluster configuration: 32 node cluster on [specify instance types if relevant] Issue Description: While training the LLaMA 70B model on a 32 node cluster, I encountered a runtime error during an attempt to adjust the learning rate. I have model that trains (with instable loss) - I reduced the learning rate from 0.0015 to 0.00015 and after 10 optimisation steps there an unexpected error in the neuron_parallel_compile process as follows: