neff errors during multi-node DDP training

woshiyyya commented 11 months ago

When I tried to run a toy example with DDP strategy on 2 trn1.32xlarge instances. To simplify the workload, I launched only 1 worker per instance (2 in total), but still got the following neff error:

mismatching number of cc ops, expected 6 received 3. Check if ranks 1 and 0 are using the same neff.

2023-Oct-27 14:18:42.0222 35560:35755 ERROR   ENC:validate_neff_cc_ops                    [nec_dev 0] mismatching number of cc ops, expected 6 received 3. Check if ranks 1 and 0 are using the same neff
2023-Oct-27 14:18:42.0222 35560:35755 ERROR   ENC:enc_load_operations                     [nec_dev 0] failed to validate neff across the replica group
2023-Oct-27 14:18:42.0222 35560:35755 ERROR  TDRV:kbl_model_add                           create_engine_refill_rings_v1() error
2023-Oct-27 14:18:42.0222 35560:35755 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2023-Oct-27 14:18:42.0222 35560:35755 ERROR  NMGR:stage_kelf_models                       Failed to stage graph: kelf-0.json to NeuronCore
2023-Oct-27 14:18:42.0222 35560:35755 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/neuroncc_compile_workdir/facc2429-b312-4584-a684-727c8ebae09e/model.MODULE_17490684959333184739+d41d8cd9.neff, err: 1
2023-Oct-27 14:18:42.223058: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure".

Training script: https://gist.github.com/woshiyyya/620a8127d4787f8ae9e9dd45fccc4b67.

To repro, you can launch multiple workers with the launcher you familiar with(mpirun/torchrun/...), each worker calls the train_func. I was using the environment variable setting defined in this PR. More specifically in this _set_xla_env_vars function [link].

Update 11/06/2023:

You can also launch this training script with Ray by installing this pip wheel.

Full stacktrace:

``` Training started without custom configuration. (TorchTrainer pid=12624, ip=10.1.161.26) Started distributed worker processes: (TorchTrainer pid=12624, ip=10.1.161.26) - (ip=10.1.161.26, pid=12769) world_rank=0, local_rank=0, node_rank=0 (TorchTrainer pid=12624, ip=10.1.161.26) - (ip=10.1.177.148, pid=35385) world_rank=1, local_rank=0, node_rank=1 (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:34.000099: 14574 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:34.000100: 14574 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/8a2330bc-c932-4460-8a2a-5032eaa12e97/model.MODULE_14128816785490974364+d41d8cd9.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/8a2330bc-c932-4460-8a2a-5032eaa12e97/model.MODULE_14128816785490974364+d41d8cd9.neff', '--verbose=35'] (RayTrainWorker pid=12769, ip=10.1.161.26) . (RayTrainWorker pid=12769, ip=10.1.161.26) (RayTrainWorker pid=12769, ip=10.1.161.26) (RayTrainWorker pid=12769, ip=10.1.161.26) Compiler status PASS (RayTrainWorker pid=35385) 2023-10-27 14:18:37.000931: 37444 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) (RayTrainWorker pid=35385) 2023-10-27 14:18:37.000932: 37444 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/7cbae8f8-1ea5-41a0-8be9-79f0e9cd1dd6/model.MODULE_9947433353067385712+d41d8cd9.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/7cbae8f8-1ea5-41a0-8be9-79f0e9cd1dd6/model.MODULE_9947433353067385712+d41d8cd9.neff', '--verbose=35'] [repeated 5x across cluster] (RayTrainWorker pid=12769, ip=10.1.161.26) . [repeated 5x across cluster] (RayTrainWorker pid=35385) [W logger.cpp:322] Warning: Time stats are currently only collected for CPU and CUDA devices. Please refer to CpuTimer or CudaTimer for how to register timer for other device type. (function operator()) (RayTrainWorker pid=12769, ip=10.1.161.26) [repeated 11x across cluster] (RayTrainWorker pid=12769, ip=10.1.161.26) Compiler status PASS [repeated 5x across cluster] (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR ENC:validate_neff_cc_ops [nec_dev 0] mismatching number of cc ops, expected 6 received 3. Check if ranks 1 and 0 are using the same neff (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR ENC:enc_load_operations [nec_dev 0] failed to validate neff across the replica group (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR TDRV:kbl_model_add create_engine_refill_rings_v1() error (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR NMGR:dlr_kelf_stage Failed to load subgraph (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-0.json to NeuronCore (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/neuroncc_compile_workdir/facc2429-b312-4584-a684-727c8ebae09e/model.MODULE_17490684959333184739+d41d8cd9.neff, err: 1 (RayTrainWorker pid=35385) 2023-10-27 14:18:42.223058: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233774: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace: (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233791: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace *** (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233794: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace() (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233797: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span, absl::lts_20220623::Span) (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233799: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&) (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233802: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233804: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::MultiWait::Complete(std::function const&) (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233806: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233809: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233811: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233813: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] clone (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233815: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace *** (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233817: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233820: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:c_localservice/replica:0/task:1: (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233822: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found. (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233824: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (0) INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233827: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]] (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233829: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[XRTExecute_G12]] (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233831: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (1) INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233833: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]] (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233836: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations. (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233844: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored. (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233846: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs: (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233849: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". 2023-10-27 14:18:42,506 ERROR tune_controller.py:1383 -- Trial task failed for trial TorchTrainer_574f4_00000 Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/worker.py", line 2564, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=12624, ip=10.1.161.26, actor_id=412959b59ebec3f5572484aa06000000, repr=TorchTrainer) File "/tmp/ray/session_2023-10-27_11-59-10_730715_155/runtime_resources/py_modules_files/_ray_pkg_af0222669bc6e932/ray/tune/trainable/trainable.py", line 342, in train raise skipped from exception_cause(skipped) File "/tmp/ray/session_2023-10-27_11-59-10_730715_155/runtime_resources/py_modules_files/_ray_pkg_af0222669bc6e932/ray/train/_internal/utils.py", line 43, in check_for_failure ray.get(object_ref) ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=12769, ip=10.1.161.26, actor_id=5712cc6b027323366a2b484b06000000, repr=) File "/tmp/ray/session_2023-10-27_11-59-10_730715_155/runtime_resources/py_modules_files/_ray_pkg_af0222669bc6e932/ray/train/_internal/worker_group.py", line 33, in __execute raise skipped from exception_cause(skipped) File "/tmp/ray/session_2023-10-27_11-59-10_730715_155/runtime_resources/py_modules_files/_ray_pkg_af0222669bc6e932/ray/train/_internal/utils.py", line 118, in discard_return_wrapper train_func(*args, **kwargs) File "release/train_tests/trainium/test_trainium.py", line 52, in train_func print(f"Loss after step {step}: {loss.cpu()}") RuntimeError: INTERNAL: From /job:c_localservice/replica:0/task:0: 2 root error(s) found. (0) INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". [[{{node XRTExecute}}]] [[XRTExecute_G12]] (1) INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". [[{{node XRTExecute}}]] 0 successful operations. 0 derived errors ignored. Recent warning and error logs: OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". Training errored after 0 iterations at 2023-10-27 14:18:42. Total running time: 27s Error file: /home/ray/ray_results/TorchTrainer_2023-10-27_14-18-11/TorchTrainer_574f4_00000_0_2023-10-27_14-18-15/error.txt 2023-10-27 14:18:42,512 ERROR tune.py:1043 -- Trials did not complete: [TorchTrainer_574f4_00000] ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=12624, ip=10.1.161.26, actor_id=412959b59ebec3f5572484aa06000000, repr=TorchTrainer) File "/tmp/ray/session_2023-10-27_11-59-10_730715_155/runtime_resources/py_modules_files/_ray_pkg_af0222669bc6e932/ray/tune/trainable/trainable.py", line 342, in train raise skipped from exception_cause(skipped) File "/tmp/ray/session_2023-10-27_11-59-10_730715_155/runtime_resources/py_modules_files/_ray_pkg_af0222669bc6e932/ray/train/_internal/utils.py", line 43, in check_for_failure ray.get(object_ref) ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=12769, ip=10.1.161.26, actor_id=5712cc6b027323366a2b484b06000000, repr=) File "/tmp/ray/session_2023-10-27_11-59-10_730715_155/runtime_resources/py_modules_files/_ray_pkg_af0222669bc6e932/ray/train/_internal/worker_group.py", line 33, in __execute raise skipped from exception_cause(skipped) File "/tmp/ray/session_2023-10-27_11-59-10_730715_155/runtime_resources/py_modules_files/_ray_pkg_af0222669bc6e932/ray/train/_internal/utils.py", line 118, in discard_return_wrapper train_func(*args, **kwargs) File "release/train_tests/trainium/test_trainium.py", line 52, in train_func print(f"Loss after step {step}: {loss.cpu()}") RuntimeError: INTERNAL: From /job:c_localservice/replica:0/task:0: 2 root error(s) found. (0) INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". [[{{node XRTExecute}}]] [[XRTExecute_G12]] (1) INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". [[{{node XRTExecute}}]] 0 successful operations. 0 derived errors ignored. Recent warning and error logs: OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". The above exception was the direct cause of the following exception: Traceback (most recent call last): File "release/train_tests/trainium/test_trainium.py", line 63, in result = trainer.fit() File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/train/base_trainer.py", line 618, in fit raise TrainingFailedError( ray.train.base_trainer.TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application logic, but rather an error such as OOM), you can restart the run from scratch or continue this run. To continue this run, you can use: `trainer = TorchTrainer.restore("/home/ray/ray_results/TorchTrainer_2023-10-27_14-18-11")`. To start a new run that will retry on training failures, set `train.RunConfig(failure_config=train.FailureConfig(max_failures))` in the Trainer's `run_config` with `max_failures > 0`, or `max_failures = -1` for unlimited retries. (RayTrainWorker pid=35385) 2023-10-27 14:18:39.000751: 37552 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache [repeated 2x across cluster] (RayTrainWorker pid=35385) 2023-10-27 14:18:39.000752: 37552 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/facc2429-b312-4584-a684-727c8ebae09e/model.MODULE_17490684959333184739+d41d8cd9.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/facc2429-b312-4584-a684-727c8ebae09e/model.MODULE_17490684959333184739+d41d8cd9.neff', '--verbose=35'] [repeated 2x across cluster] (RayTrainWorker pid=35385) . [repeated 2x across cluster] (RayTrainWorker pid=35385) [repeated 3x across cluster] (RayTrainWorker pid=35385) Compiler status PASS [repeated 2x across cluster] (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-Oct-27 14:18:42.0222 12923:13118 ERROR ENC:validate_neff_cc_ops [nec_dev 0] mismatching number of cc ops, expected 3 received 6. Check if ranks 0 and 1 are using the same neff (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-Oct-27 14:18:42.0222 12923:13118 ERROR ENC:enc_load_operations [nec_dev 0] failed to validate neff across the replica group (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-Oct-27 14:18:42.0222 12923:13118 ERROR TDRV:kbl_model_add create_engine_refill_rings_v1() error (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-Oct-27 14:18:42.0223 12923:13118 ERROR NMGR:dlr_kelf_stage Failed to load subgraph (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-Oct-27 14:18:42.0223 12923:13118 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-0.json to NeuronCore (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-Oct-27 14:18:42.0223 12923:13118 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/neuroncc_compile_workdir/9d67258c-a071-4fa1-9058-20242865e8ac/model.MODULE_10619365482826896718+d41d8cd9.neff, err: 1 (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.223725: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240624: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace: (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240646: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace *** (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240648: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace() (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240651: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span, absl::lts_20220623::Span) (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240654: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&) (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240684: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[XRTExecute_G12]] [repeated 6x across cluster] (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240659: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::MultiWait::Complete(std::function const&) (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240667: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] clone (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240670: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace *** (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240674: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:c_localservice/replica:0/task:0: (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240677: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found. (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240686: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (1) INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". [repeated 2x across cluster] (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240688: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]] [repeated 2x across cluster] (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240690: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations. (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240693: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored. (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240695: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs: (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:42.240705: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". ```

jluntamazon commented 11 months ago

Hello,

We started looking at this issue and have produced a similar error but have not yet determined the root cause. We will update here when we have a known resolution/workaround

woshiyyya commented 11 months ago

Hey @jluntamazon , thanks for investigating this issue! Do you have any findings so far?

aws-rhsoln commented 11 months ago

So we were able to replicate the issue. The issue is because two workers are running different graphs. This happened because the script you shared prints a tensor for rank 0 whereas no print on rank 1. Printing a tensor cuts a graph and executes it. This would make rank 0 execute a graph that is meant for tensor print, whereas rank 1 is not doing any such work. This will cause a mismatch in the graphs on each rank and hence the errors that you see. The easiest way to avoid this issue is to, add a mark_step after the optimizer.step. This will execute the loop on each rank and both ranks would execute the same graph.

We added the mark_step to the script, and we can make it work fine. I have attached the updated script. script.txt

aws-donkrets commented 9 months ago

Hi woshiyyya - hope aws-rhsoln updated script solved your issue. If you encounter another issue pls open a new ticket.

aws-neuron / aws-neuron-sdk

neff errors during multi-node DDP training #779

Update 11/06/2023: