aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
459 stars 153 forks source link

Support for Neuron with PyTorch Lightning #586

Closed satyajitghana closed 1 year ago

satyajitghana commented 2 years ago

Hi,

I am trying to add support for Trn1 training to PyTorch Lightning, in theory, it should have worked out of the box since PL supports TPU training. there are a few changes that need to be made. But now i am facing the below error.

(aws_neuron_venv_pytorch) ubuntu@ip-172-31-76-169:~$ python train.py 
GPU available: False, used: False
TPU available: True, using: 1 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name  | Type       | Params
-------------------------------------
0 | model | Sequential | 55.1 K
-------------------------------------
55.1 K    Trainable params
0         Non-trainable params
55.1 K    Total params
0.220     Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:236: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:236: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:  34%|████████████████████████                                               | 20/59 [00:01<00:03, 10.26it/s]2022-10-31 03:47:11.125990: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at xrt_compile_ops.cc:216 : INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
2022-10-31 03:47:11.192906: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] StackTrace:
2022-10-31 03:47:11.192939: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** Begin stack trace ***
2022-10-31 03:47:11.192942: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    tensorflow::CurrentStackTrace()
2022-10-31 03:47:11.192950: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    xla::util::ReportComputationError(tensorflow::Status const&, absl::lts_20211102::Span<xla::XlaComputation const* const>, absl::lts_20211102::Span<xla::Shape const* const>)
2022-10-31 03:47:11.192957: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    xla::XrtComputationClient::CheckCompileStatus(tensorflow::Status const&, std::vector<xla::ComputationClient::CompileInstance, std::allocator<xla::ComputationClient::CompileInstance> > const&, xla::XrtComputationClient::SessionWork const&)
2022-10-31 03:47:11.192967: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-10-31 03:47:11.192973: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    xla::util::MultiWait::Complete(std::function<void ()> const&)
2022-10-31 03:47:11.192979: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-10-31 03:47:11.192982: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-10-31 03:47:11.192988: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-10-31 03:47:11.192994: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    clone
2022-10-31 03:47:11.193001: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** End stack trace ***
2022-10-31 03:47:11.193007: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
2022-10-31 03:47:11.193013: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Status: INTERNAL: From /job:localservice/replica:0/task:0:
2022-10-31 03:47:11.193019: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 2 root error(s) found.
2022-10-31 03:47:11.193023: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   (0) INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
2022-10-31 03:47:11.193029: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    [[{{node XRTCompile_32}}]]
2022-10-31 03:47:11.193033: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    [[XRTCompile_32_G3]]
2022-10-31 03:47:11.193039: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   (1) INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
2022-10-31 03:47:11.193042: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    [[{{node XRTCompile_32}}]]
2022-10-31 03:47:11.193045: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 0 successful operations.
2022-10-31 03:47:11.193047: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 0 derived errors ignored.
2022-10-31 03:47:11.193054: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Recent warning and error logs:
2022-10-31 03:47:11.193064: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   OP_REQUIRES failed at xrt_compile_ops.cc:216 : INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
2022-10-31 03:47:12.896731: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at xrt_compile_ops.cc:216 : INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
2022-10-31 03:47:12.958159: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] StackTrace:
2022-10-31 03:47:12.958199: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** Begin stack trace ***
2022-10-31 03:47:12.958213: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    tensorflow::CurrentStackTrace()
2022-10-31 03:47:12.958222: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    xla::util::ReportComputationError(tensorflow::Status const&, absl::lts_20211102::Span<xla::XlaComputation const* const>, absl::lts_20211102::Span<xla::Shape const* const>)
2022-10-31 03:47:12.958233: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    xla::XrtComputationClient::CheckCompileStatus(tensorflow::Status const&, std::vector<xla::ComputationClient::CompileInstance, std::allocator<xla::ComputationClient::CompileInstance> > const&, xla::XrtComputationClient::SessionWork const&)
2022-10-31 03:47:12.958240: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-10-31 03:47:12.958246: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    xla::util::MultiWait::Complete(std::function<void ()> const&)
2022-10-31 03:47:12.958252: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-10-31 03:47:12.958257: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-10-31 03:47:12.958261: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-10-31 03:47:12.958269: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    clone
2022-10-31 03:47:12.958273: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** End stack trace ***
2022-10-31 03:47:12.958280: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
2022-10-31 03:47:12.958285: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Status: INTERNAL: From /job:localservice/replica:0/task:0:
2022-10-31 03:47:12.958314: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 2 root error(s) found.
2022-10-31 03:47:12.958318: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   (0) INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
2022-10-31 03:47:12.958326: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    [[{{node XRTCompile_32}}]]
2022-10-31 03:47:12.958333: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    [[XRTCompile_32_G3]]
2022-10-31 03:47:12.958339: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   (1) INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
2022-10-31 03:47:12.958345: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    [[{{node XRTCompile_32}}]]
2022-10-31 03:47:12.958367: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 0 successful operations.
2022-10-31 03:47:12.958380: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 0 derived errors ignored.
2022-10-31 03:47:12.958394: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Recent warning and error logs:
2022-10-31 03:47:12.958405: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   0 successful operations.
2022-10-31 03:47:12.958423: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   0 derived errors ignored.
2022-10-31 03:47:12.958429: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   Recent warning and error logs:
2022-10-31 03:47:12.958435: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]     OP_REQUIRES failed at xrt_compile_ops.cc:216 : INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
2022-10-31 03:47:12.958442: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   OP_REQUIRES failed at xrt_compile_ops.cc:216 : INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
Traceback (most recent call last):
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 219, in advance
    self.trainer._call_callback_hooks("on_train_batch_end", batch_end_outputs, batch, batch_idx)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1597, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/tqdm_progress.py", line 264, in on_train_batch_end
    self.main_progress_bar.set_postfix(self.get_metrics(trainer, pl_module))
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/base.py", line 243, in get_metrics
    standard_metrics = get_standard_metrics(trainer, pl_module)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/base.py", line 272, in get_standard_metrics
    avg_training_loss = running_train_loss.cpu().item()
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
    [[{{node XRTCompile_32}}]]
    [[XRTCompile_32_G3]]
  (1) INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
    [[{{node XRTCompile_32}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at xrt_compile_ops.cc:216 : INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 103, in <module>
    trainer.fit(model, dm)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 664, in _call_and_handle_interrupt
    self._teardown()
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1229, in _teardown
    self.strategy.teardown()
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/strategies/single_tpu.py", line 86, in teardown
    super().teardown()
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 472, in teardown
    optimizers_to_device(self.optimizers, torch.device("cpu"))
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/optimizer.py", line 27, in optimizers_to_device
    optimizer_to_device(opt, device)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/optimizer.py", line 33, in optimizer_to_device
    optimizer.state[p] = apply_to_collection(v, Tensor, move_data_to_device, device)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 107, in apply_to_collection
    v = apply_to_collection(
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 358, in move_data_to_device
    return apply_to_collection(batch, dtype=dtype, function=batch_to)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 351, in batch_to
    data_output = data.to(device, **kwargs)
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
    [[{{node XRTCompile_32}}]]
    [[XRTCompile_32_G3]]
  (1) INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
    [[{{node XRTCompile_32}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  0 successful operations.
  0 derived errors ignored.
  Recent warning and error logs:
    OP_REQUIRES failed at xrt_compile_ops.cc:216 : INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)
  OP_REQUIRES failed at xrt_compile_ops.cc:216 : INTERNAL: Unknown custom-call API version enum value: 0 (API_VERSION_UNSPECIFIED)

The code is from https://pytorch-lightning.readthedocs.io/en/stable/notebooks/lightning_examples/mnist-tpu-training.html and I'm trying to use single Trn1 core.

jluntamazon commented 2 years ago

Hello @satyajitghana, Could you validate which versions of the software you have installed by posting the output pip list?

satyajitghana commented 1 year ago

sorry i didn't try after that. I think i had tried the 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training-neuron:1.11.0-neuron-py38-sdk2.4.0-ubuntu20.04 image as well, and it was the same error.

mrnikwaws commented 1 year ago

Hi @satyajitghana,

I was able to reproduce your error. You can try with this modification (devices=[1]):

# Init DataModule
dm = MNISTDataModule()
# Init model from datamodule's attributes
model = LitModel(*dm.dims, dm.num_classes)
# Init trainer
trainer = Trainer(
    max_epochs=3,
    callbacks=[TQDMProgressBar(refresh_rate=20)],
    accelerator="tpu",
    devices=[1],
)

which may get you a little further. Ideally we'll also diagnose why this change is needed. I eventually hit a compilation error which we will need to investigate separately. However your real use case may not encounter this problem if you are working with a different model.

Since you seem to have given up for now I'll plan to close this issue unless you are still actively pursuing using lightning.

awsrjh commented 1 year ago

Closing inactive ticket . Please reopen if still needed.