aws / aws-k8s-tester

AWS Kubernetes tester, kubetest2 deployer implementation
Apache License 2.0
163 stars 82 forks source link

Bump Neuron SDK components versions #485

Closed nkvetsinski closed 1 month ago

nkvetsinski commented 1 month ago

Issue #, if available:

Description of changes:

Noticed that Neuron tests were failing:

torch.distributed.run: [WARNING] *****************************************
orch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
 torch.distributed.run: [WARNING] *****************************************
F external/xla/xla/parse_flags_from_env.cc:224] Unknown flags in XLA_FLAGS: --xla_gpu_simplify_all_fp_conversions=false --xla_gpu_force_compilation_parallelism=8
F external/xla/xla/parse_flags_from_env.cc:224] Unknown flags in XLA_FLAGS: --xla_gpu_simplify_all_fp_conversions=false --xla_gpu_force_compilation_parallelism=8
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 12) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
===================================================
tests/testNeuronSingleAllReduce.py FAILED
Failures:
[1]:
time : 2024-09-18_02:41:46
host : neuronx-single-node-8q4fw
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 13)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 13
---------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-18_02:41:46
host : neuronx-single-node-8q4fw
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 12)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 12
===================================================

I looked at a coredump from one of the runs, which pointed me in the direction of updating the SDK. Tests are passing with the versions from this PR.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.