Closed awaelchli closed 4 months ago
Thank you for reporting. We think that the issue could be because of a concurrency bug in neff caching mechanism. This bug was fixed in libneuronxla in 2.12 release. So if you can upgrade to the latest libneuronxla version (came out in 2.14 release) and give it a try.
If the issue still exists, can you set env variable NEURON_FRAMEWORK_DEBUG=1
. This should dump the HLO and neff in the current directory. If you can share the neff and HLO, we can take a look at our end.
Hi @aws-rhsoln Thanks for the suggestion! Upgrading to the 2.14 release (neuronx-cc 2.10.0) did not help with this error unfortunately. I attached the files that were dumped to the cwd after running with NEURON_FRAMEWORK_DEBUG=1:
Debug files: node-0-files.zip node-1-files.zip
Cache directory: cache-node0.zip cache-node1.zip
Thanks for the help
Hello @awaelchli,
You're seeing this error because you're attempting to use a multi-node configuration with trn1.2xls. This configuration is not supported because it offers less performance than the equivalent network on a single trn1.32xl. Put another way, multi-node training really only makes sense once you've exhausted the capabilities of the largest individual instance type. Nevertheless, I've opened a ticket with the appropriate team to improve the error message here.
Please let us know if you run into issues while testing using trn1.32xl instances, or if you have any other questions.
Regards, Taylor
Thanks for getting back tome @aws-taylor That's right, I tried on the smaller machines because they were easier to get for debugging purposes. I've now also tested on two trn1n.32xlarge instances, and the compilation issue disappeared, but the compilation just hangs after printing the first PASS:
...
2023-10-06 15:38:56.000057: 33836 INFO ||NEURON_CACHE||: Compile cache path: /home/zeus/content/neuron_cache2
2023-10-06 15:38:56.000060: 33836 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/0f495309-edbb-4545-b08b-94901587532b/model.MODULE_310851798465585165+2271a024.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/0f495309-edbb-4545-b08b-94901587532b/model.MODULE_310851798465585165+2271a024.neff', '--model-type', 'transformer', '--distribution-strategy=nemo', '--verbose=35']
.
Compiler status PASS
Training: 0it [00:00, ?it/s]Training: 0% 0/301501 [00:00<?, ?it/s]Epoch 0: 0% 0/301501 [00:00<?, ?it/s] 2023-10-06 15:39:03.000831: 37001 INFO ||NEURON_CACHE||: Compile cache path: /home/zeus/content/neuron_cache2
2023-10-06 15:39:03.000832: 37001 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/68376a4e-6975-4369-9a1c-a29a1a57553f/model.MODULE_7728468862527416781+2271a024.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/68376a4e-6975-4369-9a1c-a29a1a57553f/model.MODULE_7728468862527416781+2271a024.neff', '--model-type', 'transformer', '--distribution-strategy=nemo', '--verbose=35']
.
I'm not sure what the next step is here. I couldn't find any debugging flags that I haven't tried already to get more info here.
Are there any updates? I am having the same issue with a simple DDP workload on 2 trn1.32xlarge
instances.
My environment setup:
aws-neuronx-runtime-discovery==2.9
libneuronxla==0.5.476
neuronx-cc==2.10.0.35+3817a0c8c
neuronx-hwm==2.10.0.5+7b1976adf
torch-neuronx==1.13.1.1.11.0
torch-xla==1.13.1+torchneuronb
Hi @awaelchli @woshiyyya ,
Apologies for the long delay in response. It was possible that your cache was corrupted due to a shared file system file-locking issue that was fixed in a recent release.
I followed the neuron megatron example as you did (assume you have at least an 8-node cluster):
sbatch --nodes 4 compile.slurm ./gpt_23b.sh
sbatch --nodes 8 compile.slurm ./gpt_46b.sh
sbatch --nodes 8 compile.slurm ./gpt_175b.sh
After observing that the compilation jobs finish successfully with output that look like:
2024-02-28 04:40:53.000647: 3633580 INFO ||NEURON_PARALLEL_COMPILE||: Total graphs: 41
2024-02-28 04:40:53.000648: 3633580 INFO ||NEURON_PARALLEL_COMPILE||: Total successful compilations: 41
2024-02-28 04:40:53.000648: 3633580 INFO ||NEURON_PARALLEL_COMPILE||: Total failed compilations: 0
Then I run the actual runs (assume you have at least an 8-node cluster):
sbatch --nodes 4 run.slurm ./gpt_23b.sh
sbatch --nodes 8 run.slurm ./gpt_46b.sh
sbatch --nodes 8 run.slurm ./gpt_175b.sh
I observe that each of these run properly with outputs that look like:
18/301501 [07:55<2210:43:46, 26.40s/it, loss=13.4, v_num=1232, reduced_train_loss=11.60, global_step=16.00, consumed_samples=2048.0, throughput=13.30, throughput_peak=13.30]
I am using the following packages (release 2.17):
aws-neuronx-runtime-discovery 2.9
libneuronxla 0.5.809
neuronx-cc 2.12.68.0+4480452af
neuronx-hwm 2.12.0.0+422c9037c
torch-neuronx 1.13.1.1.13.1
torch-xla 1.13.1+torchneurond
Let me know if you still have problems.
Thank you for addressing this and getting back @jeffhataws.
I am running into a compilation error that I don't understand:
Full stack trace:
logs.txt
This is a multi-node experiment with the neuron megatron example on the
trn1.2xlarge
instance. I looked in the trouble shooting guide. The only thing close to this issue I found was this: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/training-troubleshooting.html#compilation-er[…]-mounted-drive The cache folder is not shared between machines so I excluded this to be the problem.What could be the possible causes here? Suggestions how to continue debugging this would be highly appreciated, thanks!
Versions:
Instance type:
trn1.2xlarge
OS: Ubuntu 20.04.6