Closed woshiyyya closed 9 months ago
Hello,
We started looking at this issue and have produced a similar error but have not yet determined the root cause. We will update here when we have a known resolution/workaround
Hey @jluntamazon , thanks for investigating this issue! Do you have any findings so far?
So we were able to replicate the issue. The issue is because two workers are running different graphs. This happened because the script you shared prints a tensor for rank 0 whereas no print on rank 1. Printing a tensor cuts a graph and executes it. This would make rank 0 execute a graph that is meant for tensor print, whereas rank 1 is not doing any such work. This will cause a mismatch in the graphs on each rank and hence the errors that you see. The easiest way to avoid this issue is to, add a mark_step after the optimizer.step. This will execute the loop on each rank and both ranks would execute the same graph.
We added the mark_step to the script, and we can make it work fine. I have attached the updated script. script.txt
Hi woshiyyya - hope aws-rhsoln updated script solved your issue. If you encounter another issue pls open a new ticket.
When I tried to run a toy example with DDP strategy on 2
trn1.32xlarge
instances. To simplify the workload, I launched only 1 worker per instance (2 in total), but still got the following neff error:mismatching number of cc ops, expected 6 received 3. Check if ranks 1 and 0 are using the same neff
.Training script: https://gist.github.com/woshiyyya/620a8127d4787f8ae9e9dd45fccc4b67.
To repro, you can launch multiple workers with the launcher you familiar with(mpirun/torchrun/...), each worker calls the
train_func
. I was using the environment variable setting defined in this PR. More specifically in this_set_xla_env_vars
function [link].Update 11/06/2023:
You can also launch this training script with Ray by installing this pip wheel.
Full stacktrace:
``` Training started without custom configuration. (TorchTrainer pid=12624, ip=10.1.161.26) Started distributed worker processes: (TorchTrainer pid=12624, ip=10.1.161.26) - (ip=10.1.161.26, pid=12769) world_rank=0, local_rank=0, node_rank=0 (TorchTrainer pid=12624, ip=10.1.161.26) - (ip=10.1.177.148, pid=35385) world_rank=1, local_rank=0, node_rank=1 (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:34.000099: 14574 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache (RayTrainWorker pid=12769, ip=10.1.161.26) 2023-10-27 14:18:34.000100: 14574 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/8a2330bc-c932-4460-8a2a-5032eaa12e97/model.MODULE_14128816785490974364+d41d8cd9.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/8a2330bc-c932-4460-8a2a-5032eaa12e97/model.MODULE_14128816785490974364+d41d8cd9.neff', '--verbose=35'] (RayTrainWorker pid=12769, ip=10.1.161.26) . (RayTrainWorker pid=12769, ip=10.1.161.26) (RayTrainWorker pid=12769, ip=10.1.161.26) (RayTrainWorker pid=12769, ip=10.1.161.26) Compiler status PASS (RayTrainWorker pid=35385) 2023-10-27 14:18:37.000931: 37444 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) (RayTrainWorker pid=35385) 2023-10-27 14:18:37.000932: 37444 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/7cbae8f8-1ea5-41a0-8be9-79f0e9cd1dd6/model.MODULE_9947433353067385712+d41d8cd9.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/7cbae8f8-1ea5-41a0-8be9-79f0e9cd1dd6/model.MODULE_9947433353067385712+d41d8cd9.neff', '--verbose=35'] [repeated 5x across cluster] (RayTrainWorker pid=12769, ip=10.1.161.26) . [repeated 5x across cluster] (RayTrainWorker pid=35385) [W logger.cpp:322] Warning: Time stats are currently only collected for CPU and CUDA devices. Please refer to CpuTimer or CudaTimer for how to register timer for other device type. (function operator()) (RayTrainWorker pid=12769, ip=10.1.161.26) [repeated 11x across cluster] (RayTrainWorker pid=12769, ip=10.1.161.26) Compiler status PASS [repeated 5x across cluster] (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR ENC:validate_neff_cc_ops [nec_dev 0] mismatching number of cc ops, expected 6 received 3. Check if ranks 1 and 0 are using the same neff (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR ENC:enc_load_operations [nec_dev 0] failed to validate neff across the replica group (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR TDRV:kbl_model_add create_engine_refill_rings_v1() error (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR NMGR:dlr_kelf_stage Failed to load subgraph (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-0.json to NeuronCore (RayTrainWorker pid=35385) 2023-Oct-27 14:18:42.0222 35560:35755 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/neuroncc_compile_workdir/facc2429-b312-4584-a684-727c8ebae09e/model.MODULE_17490684959333184739+d41d8cd9.neff, err: 1 (RayTrainWorker pid=35385) 2023-10-27 14:18:42.223058: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: Nrt::Load failed on NeuronCores 0-0(2): nrt_load status=1, error message="Non specific failure". (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233774: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace: (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233791: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace *** (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233794: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace() (RayTrainWorker pid=35385) 2023-10-27 14:18:42.233797: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span