Closed greenguy33 closed 3 weeks ago
Hi greenguy33 - I haven't looked specifically into your model but I noticed two things in your report:
NEFF Not Found
message is accurate and likely due to it not getting generated because of the Floating point exception (core dumped)
error message. This may be coming from something in your environment that the SDK is dependent upon.@aws-donkrets thanks! I can definitely try an upgrade as I suspect this is some version incompatibility problem. Could you help me understand which number from what I have posted above indicates the SDK version?
greenguy33 - The SDK is made up of various components. While there is no way to determine the SDK version from the individual component numbers, you can use the component versions to map it back to an SDK release. In this case, I used your neuronx-cc=2.7.0.40+f7c6cf2a3 reference to match it to the SDK 2.11 released in June. (see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/content.html#neuron-2-11-0-06-14-2023). You can find the contents of previous releases here - https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/content.html#pre-release-content
@aws-donkrets I have upgraded to the latest versions of these libraries using the resources at the link you provided. I am now getting a different error during training:
BIRCodegen does not support broadcast patterns, but found one in {0,+,0}[45]
Any idea what could be going on?
Full stack trace below.
[INFO|trainers.py:378] 2023-11-09 21:35:33,643 >> Saving model checkpoint to test_clm
[INFO|configuration_utils.py:460] 2023-11-09 21:35:33,645 >> Configuration saved in test_clm/config.json
[INFO|configuration_utils.py:544] 2023-11-09 21:35:33,645 >> Configuration saved in test_clm/generation_config.json
2023-11-09 21:35:34.000903: 15709 INFO ||NEURON_CACHE||: Compile cache path: /tmp/tmp_b9balli/neuron-compile-cache
2023-11-09 21:35:34.000905: 15709 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ec2-user/neuroncc_compile_workdir/11be0679-9e31-4756-b46e-82df545abfba/model.MODULE_5782661062090079200+d41d8cd9.hlo.pb', '--output', '/tmp/ec2-user/neuroncc_compile_workdir/11be0679-9e31-4756-b46e-82df545abfba/model.MODULE_5782661062090079200+d41d8cd9.neff', '--verbose=35']
.............
2023-11-09 21:39:52.000474: 15709 ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/ec2-user/neuroncc_compile_workdir/11be0679-9e31-4756-b46e-82df545abfba/model.MODULE_5782661062090079200+d41d8cd9.hlo.pb after 0 retries.
Traceback (most recent call last):
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/bin/neuron_cc_wrapper", line 8, in <module>
sys.exit(main())
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py", line 298, in main
neuron_xla_compile(args.input_file, " ".join(unparsed_args), args.output, cache_key=args.cache_key,
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py", line 268, in neuron_xla_compile
ret = compile_with_cache(output, compile_cache, cache_key, execution_mode,
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py", line 201, in compile_with_cache
raise(e)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py", line 181, in compile_with_cache
ret = call_neuron_compiler(
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py", line 129, in call_neuron_compiler
raise RuntimeError(f"Failed compilation with {cmd}: {res.stderr.decode()}")
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ec2-user/neuroncc_compile_workdir/11be0679-9e31-4756-b46e-82df545abfba/model.MODULE_5782661062090079200+d41d8cd9.hlo.pb', '--output', '/tmp/ec2-user/neuroncc_compile_workdir/11be0679-9e31-4756-b46e-82df545abfba/model.MODULE_5782661062090079200+d41d8cd9.neff', '--verbose=35']:
2023-11-09T21:39:52Z BIRCodegen does not support broadcast patterns, but found one in {0,+,0}[45]
2023-11-09 21:39:52.779806: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.781583: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.908196: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-11-09 21:39:52.908242: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-11-09 21:39:52.908251: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace()
2023-11-09 21:39:52.908255: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-11-09 21:39:52.908259: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-11-09 21:39:52.908264: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-11-09 21:39:52.908270: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-11-09 21:39:52.908276: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-11-09 21:39:52.908280: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-11-09 21:39:52.908284: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-11-09 21:39:52.908287: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] clone
2023-11-09 21:39:52.908290: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-11-09 21:39:52.908293: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-11-09 21:39:52.908297: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:localservice/replica:0/task:0:
2023-11-09 21:39:52.908300: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-11-09 21:39:52.908306: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (0) INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.908309: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]]
2023-11-09 21:39:52.908313: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[XRTExecute_G15]]
2023-11-09 21:39:52.908319: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (1) INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.908327: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]]
2023-11-09 21:39:52.908331: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-11-09 21:39:52.908336: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-11-09 21:39:52.908341: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-11-09 21:39:52.908346: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Traceback (most recent call last):
File "run_clm.py", line 616, in <module>
2023-11-09 21:39:52.909526: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-11-09 21:39:52.909556: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-11-09 21:39:52.909564: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace()
2023-11-09 21:39:52.909570: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-11-09 21:39:52.909575: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-11-09 21:39:52.909580: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-11-09 21:39:52.909583: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-11-09 21:39:52.909587: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-11-09 21:39:52.909590: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-11-09 21:39:52.909594: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-11-09 21:39:52.909597: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] clone
2023-11-09 21:39:52.909603: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-11-09 21:39:52.909606: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-11-09 21:39:52.909609: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:localservice/replica:0/task:0:
2023-11-09 21:39:52.909612: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-11-09 21:39:52.909616: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (0) INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.909619: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]]
2023-11-09 21:39:52.909623: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[XRTExecute_G15]]
2023-11-09 21:39:52.909627: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (1) INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.909633: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]]
2023-11-09 21:39:52.909636: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-11-09 21:39:52.909639: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-11-09 21:39:52.909642: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-11-09 21:39:52.909654: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.909659: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Traceback (most recent call last):
File "run_clm.py", line 616, in <module>
main()
File "run_clm.py", line 565, in main
trainer.save_model() # Saves the tokenizer too for easy upload
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/optimum/neuron/trainers.py", line 419, in save_model
self._save_xla(output_dir)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/optimum/neuron/trainers.py", line 410, in _save_xla
self.model.save_pretrained(output_dir, is_main_process=self.args.should_save, save_function=xm.save)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/transformers/modeling_utils.py", line 1993, in save_pretrained
save_function(shard, os.path.join(save_directory, shard_file))
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1216, in save
cpu_data = _maybe_convert_to_cpu(data, convert=should_write_data)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1234, in _maybe_convert_to_cpu
return ToXlaTensorArena(convert_fn, select_fn).transform(data)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 429, in transform
self._convert()
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 401, in _convert
self._converted_tensors = self._convert_fn(self._tensors)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1225, in convert_fn
torch_xla._XLAC._xla_sync_multi(
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
(0) INTERNAL: neuronx-cc compilation failed.
[[{{node XRTExecute}}]]
[[XRTExecute_G15]]
(1) INTERNAL: neuronx-cc compilation failed.
[[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
main()
File "run_clm.py", line 565, in main
trainer.save_model() # Saves the tokenizer too for easy upload
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/optimum/neuron/trainers.py", line 419, in save_model
self._save_xla(output_dir)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/optimum/neuron/trainers.py", line 410, in _save_xla
self.model.save_pretrained(output_dir, is_main_process=self.args.should_save, save_function=xm.save)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/transformers/modeling_utils.py", line 1993, in save_pretrained
save_function(shard, os.path.join(save_directory, shard_file))
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1216, in save
cpu_data = _maybe_convert_to_cpu(data, convert=should_write_data)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1234, in _maybe_convert_to_cpu
return ToXlaTensorArena(convert_fn, select_fn).transform(data)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 429, in transform
self._convert()
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 401, in _convert
self._converted_tensors = self._convert_fn(self._tensors)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1225, in convert_fn
torch_xla._XLAC._xla_sync_multi(
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
(0) INTERNAL: neuronx-cc compilation failed.
[[{{node XRTExecute}}]]
[[XRTExecute_G15]]
(1) INTERNAL: neuronx-cc compilation failed.
[[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/__init__.py", line 133, in _prepare_to_exit
_XLAC._prepare_to_exit()
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
(0) INTERNAL: neuronx-cc compilation failed.
[[{{node XRTExecute}}]]
[[XRTExecute_G15]]
(1) INTERNAL: neuronx-cc compilation failed.
[[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/__init__.py", line 133, in _prepare_to_exit
_XLAC._prepare_to_exit()
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
(0) INTERNAL: neuronx-cc compilation failed.
[[{{node XRTExecute}}]]
[[XRTExecute_G15]]
(1) INTERNAL: neuronx-cc compilation failed.
[[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9559) of binary: /home/ec2-user/finetune/aws_neuron_venv_pytorch/bin/python
Traceback (most recent call last):
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_clm.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-11-09_21:39:57
host : ip-10-0-255-91.ec2.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 9560)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-11-09_21:39:57
host : ip-10-0-255-91.ec2.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 9559)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Hi greenguy33 - this looks similar to another issue we are debugging; working on a potential fix.
@greenguy33 is this still an issue? Looks like there hasn't been any activity here for nearly a year.
@0x6b64 sorry, I'm not using this library at all anymore. I'd recommend contacting the developers if you're running into this.
Closing this issue since there has been little activity in nearly a year.
Here's a strange error I keep getting and I haven't seen that anyone has posted it yet. I'm running the Causal Language Model finetuning script found on the optimum-neuron examples page on a trn1.2xlarge instance. I haven't made any changes to the script, but I am using my own custom dataset. The training quickly errors out with the following stack trace:
Here is the command I use to call the script:
torchrun run_clm.py --model_name_or_path roberta-base --train_file training_data/training_data_toy.csv --validation_file training_data/eval_data_toy.csv --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --output_dir test_clm --block_size 10
After setting
NEURON_FRAMEWORK_DEBUG=1
, I notice that the .neff file mentioned in the stack trace is indeed missing. These are the files outputted by the debug, you can see that there is no MODULE_9_SyncTensorsGraph .neff file present.I don't think there should be anything unusual about my environment, as I am using the pre-packaged Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04). However, I have posted versions of relevant libraries below, in case something is amiss: