aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
428 stars 137 forks source link

INVALID_ARGUMENT - NEFF is invalid; it could be broken or hacked #742

Closed awaelchli closed 4 months ago

awaelchli commented 10 months ago

I am running into a compilation error that I don't understand:

  (0) INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".

Full stack trace:

2023-09-14 11:56:37.000888: INFO ||NCC_WRAPPER||: Compile cache path: /home/zeus/content/neuron_cache2
.
Compiler status PASS
Training: 0it [00:00, ?it/s]Training:   0% 0/301501 [00:00<?, ?it/s]Epoch 0:   0% 0/301501 [00:00<?, ?it/s] 2023-09-14 11:56:49.000511: INFO ||NCC_WRAPPER||: Compile cache path: /home/zeus/content/neuron_cache2
.
Compiler status PASS
2023-09-14 11:57:01.000084: INFO ||NCC_WRAPPER||: Compile cache path: /home/zeus/content/neuron_cache2
..
Compiler status PASS
2023-09-14 11:57:30.395392: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
2023-09-14 11:57:30.396510: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".

2023-09-14 11:57:30.442118: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-09-14 11:57:30.442169: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-09-14 11:57:30.442180: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    tsl::CurrentStackTrace()
2023-09-14 11:57:30.442187: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-09-14 11:57:30.442192: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-09-14 11:57:30.442200: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-09-14 11:57:30.442208: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-09-14 11:57:30.442217: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-09-14 11:57:30.442226: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-09-14 11:57:30.442234: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-09-14 11:57:30.442243: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    clone
2023-09-14 11:57:30.442255: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-09-14 11:57:30.442264: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-09-14 11:57:30.442274: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INVALID_ARGUMENT: From /job:c_localservice/replica:0/task:0:
2023-09-14 11:57:30.442283: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-09-14 11:57:30.442289: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
2023-09-14 11:57:30.442295: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[{{node XRTExecute}}]]
2023-09-14 11:57:30.442307: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[XRTExecute_G7]]
2023-09-14 11:57:30.442316: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
2023-09-14 11:57:30.442322: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[{{node XRTExecute}}]]
2023-09-14 11:57:30.442326: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-09-14 11:57:30.442333: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-09-14 11:57:30.442341: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-09-14 11:57:30.442348: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
2023-09-14 11:57:30.442356: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-09-14 11:57:30.442406: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-09-14 11:57:30.442418: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    tsl::CurrentStackTrace()
2023-09-14 11:57:30.442429: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-09-14 11:57:30.442594: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-09-14 11:57:30.442893: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-09-14 11:57:30.443045: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-09-14 11:57:30.443059: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-09-14 11:57:30.443082: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-09-14 11:57:30.443088: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-09-14 11:57:30.443101: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    clone
2023-09-14 11:57:30.443605: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-09-14 11:57:30.443647: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-09-14 11:57:30.443734: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INVALID_ARGUMENT: From /job:c_localservice/replica:0/task:0:
2023-09-14 11:57:30.443744: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-09-14 11:57:30.443750: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
2023-09-14 11:57:30.443756: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[{{node XRTExecute}}]]
2023-09-14 11:57:30.443772: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[XRTExecute_G7]]
2023-09-14 11:57:30.443862: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
2023-09-14 11:57:30.443647: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-09-14 11:57:30.443734: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INVALID_ARGUMENT: From /job:c_localservice/replica:0/task:0:
2023-09-14 11:57:30.443744: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-09-14 11:57:30.443750: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
2023-09-14 11:57:30.443756: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[{{node XRTExecute}}]]
2023-09-14 11:57:30.443772: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[XRTExecute_G7]]
2023-09-14 11:57:30.443862: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
2023-09-14 11:57:30.443937: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[{{node XRTExecute}}]]
2023-09-14 11:57:30.443947: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-09-14 11:57:30.443965: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-09-14 11:57:30.444032: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-09-14 11:57:30.444043: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
2023-09-14 11:57:30.444077: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
Error executing job with overrides: ['trainer.devices=2', 'trainer.num_nodes=2', 'trainer.max_epochs=null', 'trainer.max_steps=300000', 'trainer.val_check_interval=300000', 'trainer.log_every_n_steps=1', 'trainer.limit_val_batches=1', 'trainer.limit_test_batches=1', 'trainer.accumulate_grad_batches=1', 'trainer.precision=32', 'model.micro_batch_size=1', 'model.global_batch_size=64', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'model.max_position_embeddings=128', 'model.encoder_seq_length=128', 'model.hidden_size=1024', 'model.ffn_hidden_size=4096', 'model.num_layers=8', 'model.num_attention_heads=8', 'model.init_method_std=0.021', 'model.hidden_dropout=0.1', 'model.layernorm_epsilon=1e-5', 'model.tokenizer.vocab_file=/home/zeus/content/examples_datasets/gpt2/gpt2-vocab.json', 'model.tokenizer.merge_file=/home/zeus/content/examples_datasets/gpt2/gpt2-merges.txt', 'model.data.data_prefix=[1.0,/home/zeus/content/examples_datasets/gpt2/my-gpt2_text_document]', 'model.data.num_workers=1', 'model.data.seq_length=128', 'model.optim.name=adamw', 'model.optim.capturable=True', 'model.optim.lr=0.00015', 'model.optim.betas=[0.9,0.95]', 'model.optim.weight_decay=0.01', 'model.optim.sched.name=CosineAnnealing', 'model.optim.sched.warmup_steps=750', 'model.optim.sched.constant_steps=80000', 'model.optim.sched.min_lr=1.0e-5', 'model.sequence_parallel=True', 'model.activations_checkpoint_granularity=full', 'model.activations_checkpoint_method=uniform', 'model.activations_checkpoint_num_layers=1', '+model.save_xser=False', 'exp_manager.create_tensorboard_logger=True', 'exp_manager.resume_if_exists=False', 'exp_manager.resume_ignore_no_checkpoint=False', 'exp_manager.create_checkpoint_callback=False', 'exp_manager.explicit_log_dir=null', '+exp_manager.checkpoint_callback_params.train_time_interval=3600', 'model.use_cpu_initialization=True']

Traceback (most recent call last):
  File "/teamspace/studios/this_studio/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 129, in <module>
    main()
  File "/teamspace/studios/this_studio/neuronx-nemo-megatron/nemo/nemo/core/config/hydra_runner.py", line 105, in wrapper
    _run_hydra(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/hydra/_internal/utils.py", line 216, in run_and_report
    raise ex
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
  return func()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/teamspace/studios/this_studio/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 126, in main
    trainer.fit(model)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/teamspace/studios/this_studio/neuronx-nemo-megatron/nemo/nemo/collections/nlp/parts/nlp_overrides.py", line 120, in launch
    results = function(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
  self.advance(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/teamspace/studios/this_studio/neuronx-nemo-megatron/nemo/nemo/collections/nlp/parts/nlp_overrides.py", line 201, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
    return self.precision_plugin.optimizer_step(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 121, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/optim/adamw.py", line 120, in step
    loss = closure()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 107, in _wrap_closure
    closure_result = closure()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
    step_output = self._step_fn()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 280, in training_step
    return self.model(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward
    output = self._forward_module.training_step(*inputs, **kwargs)
  File "/teamspace/studios/this_studio/neuronx-nemo-megatron/nemo/nemo/utils/model_utils.py", line 364, in wrap_training_step
    output_dict = wrapped(*args, **kwargs)
  File "/teamspace/studios/this_studio/neuronx-nemo-megatron/nemo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 356, in training_step
    losses_per_micro_batch = fwd_bwd_function(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/apex/transformer/pipeline_parallel/schedules/fwd_bwd_no_pipelining.py", line 80, in forward_backward_no_pipelining
    cur_micro_batch = next(data_iterator)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch_xla/distributed/parallel_loader.py", line 36, in __next__
    return self.next()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch_xla/distributed/parallel_loader.py", line 48, in next
    xm.mark_step()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 988, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: INVALID_ARGUMENT: From /job:c_localservice/replica:0/task:0:
2 root error(s) found.
  (0) INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
     [[{{node XRTExecute}}]]
     [[XRTExecute_G12]]
  (1) INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
     [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INVALID_ARGUMENT: tensorflow/libtpu/neuron/nrt_adaptor.cc:311:NEFF is invalid; it could be broken or hacked: nrt_load: status=2, error message="Invalid argument".
./test_lightning_new.sh: line 160:  1047 Aborted                 (core dumped) $MAYBE_COMPILE python megatron_gpt_pretraining.py $SCRIPT_ARGS
Epoch 0:   0%|          | 0/301501 [00:44<?, ?it/s]

logs.txt

This is a multi-node experiment with the neuron megatron example on the trn1.2xlarge instance. I looked in the trouble shooting guide. The only thing close to this issue I found was this: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/training-troubleshooting.html#compilation-er[…]-mounted-drive The cache folder is not shared between machines so I excluded this to be the problem.

What could be the possible causes here? Suggestions how to continue debugging this would be highly appreciated, thanks!

Versions:

pip list | grep neuron
aws-neuronx-runtime-discovery 2.9
libneuronxla                  0.5.440
nemo-toolkit                  1.14.0              /teamspace/studios/this_studio/neuronx-nemo-megatron/nemo
neuronx-cc                    2.9.0.40+07376825f
neuronx-hwm                   2.9.0.2+f79d59e7b
torch-neuronx                 1.13.1.1.10.1
torch-xla                     1.13.1+torchneurona

Instance type: trn1.2xlarge OS: Ubuntu 20.04.6

aws-rhsoln commented 10 months ago

Thank you for reporting. We think that the issue could be because of a concurrency bug in neff caching mechanism. This bug was fixed in libneuronxla in 2.12 release. So if you can upgrade to the latest libneuronxla version (came out in 2.14 release) and give it a try. If the issue still exists, can you set env variable NEURON_FRAMEWORK_DEBUG=1 . This should dump the HLO and neff in the current directory. If you can share the neff and HLO, we can take a look at our end.

awaelchli commented 10 months ago

Hi @aws-rhsoln Thanks for the suggestion! Upgrading to the 2.14 release (neuronx-cc 2.10.0) did not help with this error unfortunately. I attached the files that were dumped to the cwd after running with NEURON_FRAMEWORK_DEBUG=1:

Debug files: node-0-files.zip node-1-files.zip

Cache directory: cache-node0.zip cache-node1.zip

Thanks for the help

aws-taylor commented 9 months ago

Hello @awaelchli,

You're seeing this error because you're attempting to use a multi-node configuration with trn1.2xls. This configuration is not supported because it offers less performance than the equivalent network on a single trn1.32xl. Put another way, multi-node training really only makes sense once you've exhausted the capabilities of the largest individual instance type. Nevertheless, I've opened a ticket with the appropriate team to improve the error message here.

Please let us know if you run into issues while testing using trn1.32xl instances, or if you have any other questions.

Regards, Taylor

awaelchli commented 9 months ago

Thanks for getting back tome @aws-taylor That's right, I tried on the smaller machines because they were easier to get for debugging purposes. I've now also tested on two trn1n.32xlarge instances, and the compilation issue disappeared, but the compilation just hangs after printing the first PASS:

...

2023-10-06 15:38:56.000057:  33836  INFO ||NEURON_CACHE||: Compile cache path: /home/zeus/content/neuron_cache2
2023-10-06 15:38:56.000060:  33836  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/0f495309-edbb-4545-b08b-94901587532b/model.MODULE_310851798465585165+2271a024.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/0f495309-edbb-4545-b08b-94901587532b/model.MODULE_310851798465585165+2271a024.neff', '--model-type', 'transformer', '--distribution-strategy=nemo', '--verbose=35']
.
Compiler status PASS
Training: 0it [00:00, ?it/s]Training:   0% 0/301501 [00:00<?, ?it/s]Epoch 0:   0% 0/301501 [00:00<?, ?it/s] 2023-10-06 15:39:03.000831:  37001  INFO ||NEURON_CACHE||: Compile cache path: /home/zeus/content/neuron_cache2
2023-10-06 15:39:03.000832:  37001  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/68376a4e-6975-4369-9a1c-a29a1a57553f/model.MODULE_7728468862527416781+2271a024.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/68376a4e-6975-4369-9a1c-a29a1a57553f/model.MODULE_7728468862527416781+2271a024.neff', '--model-type', 'transformer', '--distribution-strategy=nemo', '--verbose=35']
.

I'm not sure what the next step is here. I couldn't find any debugging flags that I haven't tried already to get more info here.

woshiyyya commented 9 months ago

Are there any updates? I am having the same issue with a simple DDP workload on 2 trn1.32xlarge instances.

My environment setup:

aws-neuronx-runtime-discovery==2.9
libneuronxla==0.5.476
neuronx-cc==2.10.0.35+3817a0c8c
neuronx-hwm==2.10.0.5+7b1976adf
torch-neuronx==1.13.1.1.11.0
torch-xla==1.13.1+torchneuronb
jeffhataws commented 4 months ago

Hi @awaelchli @woshiyyya ,

Apologies for the long delay in response. It was possible that your cache was corrupted due to a shared file system file-locking issue that was fixed in a recent release.

I followed the neuron megatron example as you did (assume you have at least an 8-node cluster):

sbatch --nodes 4 compile.slurm ./gpt_23b.sh
sbatch --nodes 8 compile.slurm ./gpt_46b.sh
sbatch --nodes 8 compile.slurm ./gpt_175b.sh

After observing that the compilation jobs finish successfully with output that look like:

2024-02-28 04:40:53.000647:  3633580  INFO ||NEURON_PARALLEL_COMPILE||: Total graphs: 41
2024-02-28 04:40:53.000648:  3633580  INFO ||NEURON_PARALLEL_COMPILE||: Total successful compilations: 41
2024-02-28 04:40:53.000648:  3633580  INFO ||NEURON_PARALLEL_COMPILE||: Total failed compilations: 0

Then I run the actual runs (assume you have at least an 8-node cluster):

sbatch --nodes 4 run.slurm ./gpt_23b.sh
sbatch --nodes 8 run.slurm ./gpt_46b.sh
sbatch --nodes 8 run.slurm ./gpt_175b.sh

I observe that each of these run properly with outputs that look like:

18/301501 [07:55<2210:43:46, 26.40s/it, loss=13.4, v_num=1232, reduced_train_loss=11.60, global_step=16.00, consumed_samples=2048.0, throughput=13.30, throughput_peak=13.30]

I am using the following packages (release 2.17):

aws-neuronx-runtime-discovery 2.9
libneuronxla                  0.5.809
neuronx-cc                    2.12.68.0+4480452af
neuronx-hwm                   2.12.0.0+422c9037c
torch-neuronx                 1.13.1.1.13.1
torch-xla                     1.13.1+torchneurond

Let me know if you still have problems.

awaelchli commented 4 months ago

Thank you for addressing this and getting back @jeffhataws.