aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
452 stars 152 forks source link

Error Finetuning Huggingface model - NEFF Not Found #781

Closed greenguy33 closed 3 weeks ago

greenguy33 commented 1 year ago

Here's a strange error I keep getting and I haven't seen that anyone has posted it yet. I'm running the Causal Language Model finetuning script found on the optimum-neuron examples page on a trn1.2xlarge instance. I haven't made any changes to the script, but I am using my own custom dataset. The training quickly errors out with the following stack trace:

 17%|██████████████████████                                                                                                              | 1/6 [01:04<05:21, 64.35s/it]2023-11-02 20:19:29.000859: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpgp_fxpfv/neuron-compile-cache
..........Floating point exception (core dumped)
2023-11-02 20:22:52.000307: INFO ||NCC_WRAPPER||: Compilation failed for /tmp/neuroncc_compile_workdir/93d1a5eb-bf3c-4fb5-97a8-acb8c2043b65/model.MODULE_4109439366224317951+d41d8cd9.hlo.pb after 0 retries.
2023-11-02 20:22:52.000307: INFO ||NCC_WRAPPER||: Compilation failed after reaching max retries.
2023-11-02 20:22:52.333417: E tensorflow/libtpu/neuron/neuron_compiler.cc:216] NEURONPOC : Unable to delete temp file /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff
2023-11-02 20:22:52.333460: E tensorflow/libtpu/neuron/neuron_compiler.cc:371] NEURONPOC: Could not read NEFF from MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff - Status : NOT_FOUND: /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff; No such file or directory
2023-11-02 20:22:52.443697: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff; No such file or directory
2023-11-02 20:22:52.537789: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-11-02 20:22:52.537820: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-11-02 20:22:52.537824: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    tsl::CurrentStackTrace()
2023-11-02 20:22:52.537827: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-11-02 20:22:52.537837: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-11-02 20:22:52.537841: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-02 20:22:52.537846: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-11-02 20:22:52.537852: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-02 20:22:52.537855: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-02 20:22:52.537860: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-02 20:22:52.537863: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-02 20:22:52.537869: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-11-02 20:22:52.537871: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-11-02 20:22:52.537877: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: NOT_FOUND: From /job:localservice/replica:0/task:0:
2023-11-02 20:22:52.537880: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-11-02 20:22:52.537886: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) NOT_FOUND: /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff; No such file or directory
2023-11-02 20:22:52.537894: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[{{node XRTExecute}}]]
2023-11-02 20:22:52.537897: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[XRTExecute_G15]]
2023-11-02 20:22:52.537903: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) NOT_FOUND: /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff; No such file or directory
2023-11-02 20:22:52.537906: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[{{node XRTExecute}}]]
2023-11-02 20:22:52.537909: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-11-02 20:22:52.537912: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-11-02 20:22:52.537917: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-11-02 20:22:52.537920: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff; No such file or directory
Traceback (most recent call last):
  File "/home/ubuntu/subclass_task/reinforcement_learning/run_clm.py", line 644, in <module>
    main()
  File "/home/ubuntu/subclass_task/reinforcement_learning/run_clm.py", line 591, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/utils/patching.py", line 180, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 439, in _inner_training_loop
    return super()._inner_training_loop(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1838, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2693, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 287, in compute_loss
    self.trigger_on_step_middle_for_neuron_cache_callback(model)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 252, in trigger_on_step_middle_for_neuron_cache_callback
    callback.on_step_middle(self.args, self.state, self.control, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/trainer_callback.py", line 302, in on_step_middle
    self.neuron_hash_for_model(args, model, state.last_inputs, try_to_fetch_cached_model=True)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/trainer_callback.py", line 229, in neuron_hash_for_model
    neuron_hash = NeuronHash(*key_args, **key_kwargs)
  File "<string>", line 13, in __init__
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/utils/cache_utils.py", line 599, in __post_init__
    self.compute_hash(model)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/utils/cache_utils.py", line 634, in compute_hash
    model_hash = self.compute_sha512_hash(self.state_dict_to_bytes(model.state_dict()))
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/utils/cache_utils.py", line 619, in state_dict_to_bytes
    np.save(memfile, tensor.to(cast_to_mapping.get(tensor.dtype, tensor.dtype)).cpu().numpy())
RuntimeError: NOT_FOUND: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) NOT_FOUND: /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff; No such file or directory
     [[{{node XRTExecute}}]]
     [[XRTExecute_G15]]
  (1) NOT_FOUND: /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff; No such file or directory
     [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff; No such file or directory
Exception ignored in atexit callback: <function _prepare_to_exit at 0x7f5bb83656c0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch_xla/__init__.py", line 133, in _prepare_to_exit
    _XLAC._prepare_to_exit()
RuntimeError: NOT_FOUND: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) NOT_FOUND: /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff; No such file or directory
     [[{{node XRTExecute}}]]
     [[XRTExecute_G15]]
  (1) NOT_FOUND: /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff; No such file or directory
     [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_9_SyncTensorsGraph.13400_4109439366224317951_ip-10-0-0-135-d91cfe7b-5791-609311ae7f58d.neff; No such file or directory
 17%|█████████████████████▊                                                                                                             | 1/6 [04:27<22:19, 267.86s/it]
.ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5743) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-02_20:22:57
  host      : ip-10-0-0-135.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5743)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Here is the command I use to call the script: torchrun run_clm.py --model_name_or_path roberta-base --train_file training_data/training_data_toy.csv --validation_file training_data/eval_data_toy.csv --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --output_dir test_clm --block_size 10

After setting NEURON_FRAMEWORK_DEBUG=1, I notice that the .neff file mentioned in the stack trace is indeed missing. These are the files outputted by the debug, you can see that there is no MODULE_9_SyncTensorsGraph .neff file present.

MODULE_0_SyncTensorsGraph.209_7180866674807936198.hlo.pb
MODULE_0_SyncTensorsGraph.209_7180866674807936198.neff
MODULE_0_SyncTensorsGraph.209_7180866674807936198.pbtxt
MODULE_0_SyncTensorsGraph.209_7180866674807936198_compile_flags.txt
MODULE_0_SyncTensorsGraph.3_18108312931891040135.hlo.pb
MODULE_0_SyncTensorsGraph.3_18108312931891040135.neff
MODULE_0_SyncTensorsGraph.3_18108312931891040135.pbtxt
MODULE_0_SyncTensorsGraph.3_18108312931891040135_compile_flags.txt
MODULE_1_SyncTensorsGraph.25340_15578449491978418824.hlo.pb
MODULE_1_SyncTensorsGraph.25340_15578449491978418824.neff
MODULE_1_SyncTensorsGraph.25340_15578449491978418824.pbtxt
MODULE_1_SyncTensorsGraph.25340_15578449491978418824_compile_flags.txt
MODULE_1_SyncTensorsGraph.3_15990296087481957400.hlo.pb
MODULE_1_SyncTensorsGraph.3_15990296087481957400.neff
MODULE_1_SyncTensorsGraph.3_15990296087481957400.pbtxt
MODULE_1_SyncTensorsGraph.3_15990296087481957400_compile_flags.txt
MODULE_2_SyncTensorsGraph.25340_18241118073218349609.hlo.pb
MODULE_2_SyncTensorsGraph.25340_18241118073218349609.neff
MODULE_2_SyncTensorsGraph.25340_18241118073218349609.pbtxt
MODULE_2_SyncTensorsGraph.25340_18241118073218349609_compile_flags.txt
MODULE_2_SyncTensorsGraph.3_526876084215800545.hlo.pb
MODULE_2_SyncTensorsGraph.3_526876084215800545.neff
MODULE_2_SyncTensorsGraph.3_526876084215800545.pbtxt
MODULE_2_SyncTensorsGraph.3_526876084215800545_compile_flags.txt
MODULE_3_SyncTensorsGraph.25340_6223395303583567286.hlo.pb
MODULE_3_SyncTensorsGraph.25340_6223395303583567286.neff
MODULE_3_SyncTensorsGraph.25340_6223395303583567286.pbtxt
MODULE_3_SyncTensorsGraph.25340_6223395303583567286_compile_flags.txt
MODULE_3_SyncTensorsGraph.3_11398946548965431583.hlo.pb
MODULE_3_SyncTensorsGraph.3_11398946548965431583.neff
MODULE_3_SyncTensorsGraph.3_11398946548965431583.pbtxt
MODULE_3_SyncTensorsGraph.3_11398946548965431583_compile_flags.txt
MODULE_4_SyncTensorsGraph.3_9290870109582453889.hlo.pb
MODULE_4_SyncTensorsGraph.3_9290870109582453889.neff
MODULE_4_SyncTensorsGraph.3_9290870109582453889.pbtxt
MODULE_4_SyncTensorsGraph.3_9290870109582453889_compile_flags.txt
MODULE_4_SyncTensorsGraph.5933_7142096159463396829.hlo.pb
MODULE_4_SyncTensorsGraph.5933_7142096159463396829.neff
MODULE_4_SyncTensorsGraph.5933_7142096159463396829.pbtxt
MODULE_4_SyncTensorsGraph.5933_7142096159463396829_compile_flags.txt
MODULE_5_SyncTensorsGraph.16485_2622872833963921831.hlo.pb
MODULE_5_SyncTensorsGraph.16485_2622872833963921831.pbtxt
MODULE_5_SyncTensorsGraph.16485_2622872833963921831_compile_flags.txt
MODULE_5_SyncTensorsGraph.3_13661006737930459022.hlo.pb
MODULE_5_SyncTensorsGraph.3_13661006737930459022.neff
MODULE_5_SyncTensorsGraph.3_13661006737930459022.pbtxt
MODULE_5_SyncTensorsGraph.3_13661006737930459022_compile_flags.txt
MODULE_6_SyncTensorsGraph.3_6534628959653098333.hlo.pb
MODULE_6_SyncTensorsGraph.3_6534628959653098333.neff
MODULE_6_SyncTensorsGraph.3_6534628959653098333.pbtxt
MODULE_6_SyncTensorsGraph.3_6534628959653098333_compile_flags.txt
MODULE_7_SyncTensorsGraph.3_4172853236024427751.hlo.pb
MODULE_7_SyncTensorsGraph.3_4172853236024427751.neff
MODULE_7_SyncTensorsGraph.3_4172853236024427751.pbtxt
MODULE_7_SyncTensorsGraph.3_4172853236024427751_compile_flags.txt
MODULE_8_SyncTensorsGraph.5_2715247842289578702.hlo.pb
MODULE_8_SyncTensorsGraph.5_2715247842289578702.neff
MODULE_8_SyncTensorsGraph.5_2715247842289578702.pbtxt
MODULE_8_SyncTensorsGraph.5_2715247842289578702_compile_flags.txt
MODULE_9_SyncTensorsGraph.13400_4109439366224317951.hlo.pb
MODULE_9_SyncTensorsGraph.13400_4109439366224317951.pbtxt
MODULE_9_SyncTensorsGraph.13400_4109439366224317951_compile_flags.txt

I don't think there should be anything unusual about my environment, as I am using the pre-packaged Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04). However, I have posted versions of relevant libraries below, in case something is amiss:

aws-neuronx-runtime-discovery=2.9
libneuronxla=0.5.326
neuronx-cc=2.7.0.40+f7c6cf2a3
neuronx-hwm=2.7.0.3+0092b9d34
optimum=1.13.2
optimum-neuron=0.0.11
torch=1.13.1
torch-neuronx=.13.1.1.8.0
torch-xla=1.13.1+torchneuron7
torchvision=0.14.1
aws-donkrets commented 12 months ago

Hi greenguy33 - I haven't looked specifically into your model but I noticed two things in your report:

  1. your SDK version id 3-4 months old so I would recommend updating to the latest and trying your model again.
  2. While keeping up-to-date with the latest SDK is good practice, the NEFF Not Found message is accurate and likely due to it not getting generated because of the Floating point exception (core dumped) error message. This may be coming from something in your environment that the SDK is dependent upon.
greenguy33 commented 12 months ago

@aws-donkrets thanks! I can definitely try an upgrade as I suspect this is some version incompatibility problem. Could you help me understand which number from what I have posted above indicates the SDK version?

aws-donkrets commented 12 months ago

greenguy33 - The SDK is made up of various components. While there is no way to determine the SDK version from the individual component numbers, you can use the component versions to map it back to an SDK release. In this case, I used your neuronx-cc=2.7.0.40+f7c6cf2a3 reference to match it to the SDK 2.11 released in June. (see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/content.html#neuron-2-11-0-06-14-2023). You can find the contents of previous releases here - https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/content.html#pre-release-content

greenguy33 commented 11 months ago

@aws-donkrets I have upgraded to the latest versions of these libraries using the resources at the link you provided. I am now getting a different error during training: BIRCodegen does not support broadcast patterns, but found one in {0,+,0}[45] Any idea what could be going on? Full stack trace below.

[INFO|trainers.py:378] 2023-11-09 21:35:33,643 >> Saving model checkpoint to test_clm
[INFO|configuration_utils.py:460] 2023-11-09 21:35:33,645 >> Configuration saved in test_clm/config.json
[INFO|configuration_utils.py:544] 2023-11-09 21:35:33,645 >> Configuration saved in test_clm/generation_config.json
2023-11-09 21:35:34.000903:  15709  INFO ||NEURON_CACHE||: Compile cache path: /tmp/tmp_b9balli/neuron-compile-cache
2023-11-09 21:35:34.000905:  15709  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ec2-user/neuroncc_compile_workdir/11be0679-9e31-4756-b46e-82df545abfba/model.MODULE_5782661062090079200+d41d8cd9.hlo.pb', '--output', '/tmp/ec2-user/neuroncc_compile_workdir/11be0679-9e31-4756-b46e-82df545abfba/model.MODULE_5782661062090079200+d41d8cd9.neff', '--verbose=35']
.............
2023-11-09 21:39:52.000474:  15709  ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/ec2-user/neuroncc_compile_workdir/11be0679-9e31-4756-b46e-82df545abfba/model.MODULE_5782661062090079200+d41d8cd9.hlo.pb after 0 retries.
Traceback (most recent call last):
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/bin/neuron_cc_wrapper", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py", line 298, in main
    neuron_xla_compile(args.input_file, " ".join(unparsed_args), args.output, cache_key=args.cache_key,
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py", line 268, in neuron_xla_compile
    ret = compile_with_cache(output, compile_cache, cache_key, execution_mode,
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py", line 201, in compile_with_cache
    raise(e)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py", line 181, in compile_with_cache
    ret = call_neuron_compiler(
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py", line 129, in call_neuron_compiler
    raise RuntimeError(f"Failed compilation with {cmd}: {res.stderr.decode()}")
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ec2-user/neuroncc_compile_workdir/11be0679-9e31-4756-b46e-82df545abfba/model.MODULE_5782661062090079200+d41d8cd9.hlo.pb', '--output', '/tmp/ec2-user/neuroncc_compile_workdir/11be0679-9e31-4756-b46e-82df545abfba/model.MODULE_5782661062090079200+d41d8cd9.neff', '--verbose=35']: 
2023-11-09T21:39:52Z BIRCodegen does not support broadcast patterns, but found one in {0,+,0}[45]

2023-11-09 21:39:52.779806: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.781583: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.908196: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-11-09 21:39:52.908242: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-11-09 21:39:52.908251: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    tsl::CurrentStackTrace()
2023-11-09 21:39:52.908255: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-11-09 21:39:52.908259: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-11-09 21:39:52.908264: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-09 21:39:52.908270: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-11-09 21:39:52.908276: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-09 21:39:52.908280: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-09 21:39:52.908284: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-09 21:39:52.908287: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    clone
2023-11-09 21:39:52.908290: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-11-09 21:39:52.908293: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-11-09 21:39:52.908297: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:localservice/replica:0/task:0:
2023-11-09 21:39:52.908300: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-11-09 21:39:52.908306: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.908309: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[{{node XRTExecute}}]]
2023-11-09 21:39:52.908313: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[XRTExecute_G15]]
2023-11-09 21:39:52.908319: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.908327: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[{{node XRTExecute}}]]
2023-11-09 21:39:52.908331: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-11-09 21:39:52.908336: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-11-09 21:39:52.908341: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-11-09 21:39:52.908346: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Traceback (most recent call last):
  File "run_clm.py", line 616, in <module>
2023-11-09 21:39:52.909526: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-11-09 21:39:52.909556: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-11-09 21:39:52.909564: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    tsl::CurrentStackTrace()
2023-11-09 21:39:52.909570: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-11-09 21:39:52.909575: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-11-09 21:39:52.909580: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-09 21:39:52.909583: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-11-09 21:39:52.909587: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-09 21:39:52.909590: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-09 21:39:52.909594: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    
2023-11-09 21:39:52.909597: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]    clone
2023-11-09 21:39:52.909603: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-11-09 21:39:52.909606: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-11-09 21:39:52.909609: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:localservice/replica:0/task:0:
2023-11-09 21:39:52.909612: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-11-09 21:39:52.909616: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.909619: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[{{node XRTExecute}}]]
2023-11-09 21:39:52.909623: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[XRTExecute_G15]]
2023-11-09 21:39:52.909627: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.909633: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]     [[{{node XRTExecute}}]]
2023-11-09 21:39:52.909636: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-11-09 21:39:52.909639: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-11-09 21:39:52.909642: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-11-09 21:39:52.909654: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
2023-11-09 21:39:52.909659: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Traceback (most recent call last):
  File "run_clm.py", line 616, in <module>
    main()
  File "run_clm.py", line 565, in main
    trainer.save_model()  # Saves the tokenizer too for easy upload
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/optimum/neuron/trainers.py", line 419, in save_model
    self._save_xla(output_dir)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/optimum/neuron/trainers.py", line 410, in _save_xla
    self.model.save_pretrained(output_dir, is_main_process=self.args.should_save, save_function=xm.save)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/transformers/modeling_utils.py", line 1993, in save_pretrained
    save_function(shard, os.path.join(save_directory, shard_file))
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1216, in save
    cpu_data = _maybe_convert_to_cpu(data, convert=should_write_data)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1234, in _maybe_convert_to_cpu
    return ToXlaTensorArena(convert_fn, select_fn).transform(data)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 429, in transform
    self._convert()
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 401, in _convert
    self._converted_tensors = self._convert_fn(self._tensors)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1225, in convert_fn
    torch_xla._XLAC._xla_sync_multi(
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) INTERNAL: neuronx-cc compilation failed.
     [[{{node XRTExecute}}]]
     [[XRTExecute_G15]]
  (1) INTERNAL: neuronx-cc compilation failed.
     [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
    main()
  File "run_clm.py", line 565, in main
    trainer.save_model()  # Saves the tokenizer too for easy upload
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/optimum/neuron/trainers.py", line 419, in save_model
    self._save_xla(output_dir)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/optimum/neuron/trainers.py", line 410, in _save_xla
    self.model.save_pretrained(output_dir, is_main_process=self.args.should_save, save_function=xm.save)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/transformers/modeling_utils.py", line 1993, in save_pretrained
    save_function(shard, os.path.join(save_directory, shard_file))
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1216, in save
    cpu_data = _maybe_convert_to_cpu(data, convert=should_write_data)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1234, in _maybe_convert_to_cpu
    return ToXlaTensorArena(convert_fn, select_fn).transform(data)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 429, in transform
    self._convert()
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 401, in _convert
    self._converted_tensors = self._convert_fn(self._tensors)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 1225, in convert_fn
    torch_xla._XLAC._xla_sync_multi(
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) INTERNAL: neuronx-cc compilation failed.
     [[{{node XRTExecute}}]]
     [[XRTExecute_G15]]
  (1) INTERNAL: neuronx-cc compilation failed.
     [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/__init__.py", line 133, in _prepare_to_exit
    _XLAC._prepare_to_exit()
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) INTERNAL: neuronx-cc compilation failed.
     [[{{node XRTExecute}}]]
     [[XRTExecute_G15]]
  (1) INTERNAL: neuronx-cc compilation failed.
     [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/__init__.py", line 133, in _prepare_to_exit
    _XLAC._prepare_to_exit()
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) INTERNAL: neuronx-cc compilation failed.
     [[{{node XRTExecute}}]]
     [[XRTExecute_G15]]
  (1) INTERNAL: neuronx-cc compilation failed.
     [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9559) of binary: /home/ec2-user/finetune/aws_neuron_venv_pytorch/bin/python
Traceback (most recent call last):
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ec2-user/finetune/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-11-09_21:39:57
  host      : ip-10-0-255-91.ec2.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 9560)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-09_21:39:57
  host      : ip-10-0-255-91.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 9559)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
aws-donkrets commented 10 months ago

Hi greenguy33 - this looks similar to another issue we are debugging; working on a potential fix.

0x6b64 commented 1 month ago

@greenguy33 is this still an issue? Looks like there hasn't been any activity here for nearly a year.

greenguy33 commented 1 month ago

@0x6b64 sorry, I'm not using this library at all anymore. I'd recommend contacting the developers if you're running into this.

aws-donkrets commented 3 weeks ago

Closing this issue since there has been little activity in nearly a year.