aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
420 stars 136 forks source link

Failure on neuron-cc compilation when a nn model is moved to Neuron device #872

Open wfckl789 opened 2 months ago

wfckl789 commented 2 months ago

Hi, I find that neuron compiler will crash when compiling my customized model running on a Neuron device. Could you help check this issue? Thanks so much!

Below is my model sample and the training process:

class SimpleLinearModel(nn.Module):
    def __init__(self, args):
        super(SimpleLinearModel, self).__init__()
        self.encoder = nn.Embedding(args.vocab_size, args.embedding_dim)
        self.linear_0 = nn.Linear(args.embedding_dim, args.vocab_size)
        self.linear_1 = nn.Linear(args.vocab_size, args.vocab_size)
        self.linear_2 = nn.Linear(args.vocab_size, args.vocab_size)
        self.linear_3 = nn.Linear(args.vocab_size, 1)

    def forward(self, text):
        embedded = self.encoder(text)
        predict_1 = self.linear_0(embedded)
        predict_2 = self.linear_1(predict_1)
        predict_3 = self.linear_2(predict_2)
        model_out = self.linear_3(predict_3)
        model_out = model_out.mean()
        return model_out

      matrix_2d = torch.tensor([
                [0, 0, 0, 0],
                [16, 16, 16, 16],
                [32, 32, 32, 32],
                [64, 64, 64, 64],
                [127, 127, 127, 127]
            ])

# Train model on xla device
def train_model()
    device = xm.xla_device()
    model = SimpleLinearModel(args)
    model = model.to(device)
    optimizer = optim.AdamW(model.parameters())

    for epoch in range(0, 2):
        for idx, text in enumerate(matrix_2d):
            text = text.to(device)
            output = model(text)
            loss = output * 2
            loss.backward()
            print(f"Epoch {epoch}, Data index: {idx}, Loss: {loss}")
            optimizer.step()
            xm.mark_step()

Crashing:

model config Namespace(embedding_dim=256, max_steps=2, seed=1234, vocab_size=128)
2024-04-12 07:13:06.000296:  271334  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-04-12 07:13:06.000297:  271334  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/50f29840-cfac-49a2-8d40-8b0369b8618f/model.MODULE_16518787498567201059+d41d8cd9.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/50f29840-cfac-49a2-8d40-8b0369b8618f/model.MODULE_16518787498567201059+d41d8cd9.neff --verbose=35
.
Compiler status PASS
Epoch 0, Data index: 0, Loss: 0.1943359375
2024-04-12 07:13:08.000245:  271334  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-04-12 07:13:08.000246:  271334  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/889051d2-62ba-41fa-887f-e9651545b6ee/model.MODULE_12582736310589149268+d41d8cd9.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/889051d2-62ba-41fa-887f-e9651545b6ee/model.MODULE_12582736310589149268+d41d8cd9.neff --verbose=35
.root = /usr/lib/python3.8/multiprocessing/process.py
root = /usr/lib/python3.8/multiprocessing
root = /usr/lib/python3.8
root = /usr/lib
root = /usr

2024-04-12 07:13:10.000175:  271334  ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/889051d2-62ba-41fa-887f-e9651545b6ee/model.MODULE_12582736310589149268+d41d8cd9.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/889051d2-62ba-41fa-887f-e9651545b6ee/model.MODULE_12582736310589149268+d41d8cd9.neff', '--verbose=35']: 2024-04-12T07:13:10Z [TEN404] Internal tensorizer error - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new

2024-04-12 07:13:10.000175:  271334  ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/ubuntu/neuroncc_compile_workdir/889051d2-62ba-41fa-887f-e9651545b6ee/model.MODULE_12582736310589149268+d41d8cd9.hlo_module.pb after 0 retries.
2024-04-12 07:13:10.203311: F ./torch_xla/csrc/runtime/debug_macros.h:20] Non-OK-status: status.status() status: INTERNAL: RunNeuronCCImpl: error condition error != 0: <class 'subprocess.CalledProcessError'>: Command '['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/889051d2-62ba-41fa-887f-e9651545b6ee/model.MODULE_12582736310589149268+d41d8cd9.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/889051d2-62ba-41fa-887f-e9651545b6ee/model.MODULE_12582736310589149268+d41d8cd9.neff', '--verbose=35']' returned non-zero exit status 70.
*** Begin stack trace ***
    tsl::CurrentStackTrace()
    std::unique_ptr<xla::PjRtLoadedExecutable, std::default_delete<xla::PjRtLoadedExecutable> > ConsumeValue<std::unique_ptr<xla::PjRtLoadedExecutable, std::default_delete<xla::PjRtLoadedExecutable> > >(absl::lts_20230125::StatusOr<std::unique_ptr<xla::PjRtLoadedExecutable, std::default_delete<xla::PjRtLoadedExecutable> > >&&)
    torch_xla::runtime::PjRtComputationClient::Compile(std::vector<torch_xla::runtime::ComputationClient::CompileInstance, std::allocator<torch_xla::runtime::ComputationClient::CompileInstance> >)
    torch_xla::XLAGraphExecutor::Compile(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > > const&, absl::lts_20230125::Span<std::string const>, torch::lazy::LazyGraphExecutor::SyncTensorCollection const&, torch::lazy::LazyGraphExecutor::PostOrderData*, std::vector<torch::lazy::Value, std::allocator<torch::lazy::Value> > const&)
    torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230125::Span<std::string const>, torch::lazy::LazyGraphExecutor::SyncTensorsConfig const&, bool)
    torch_xla::XLAGraphExecutor::SyncTensorsGraph(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230125::Span<std::string const>, bool, bool, bool)
    torch_xla::XLAGraphExecutor::SyncLiveTensorsGraph(torch::lazy::BackendDevice const*, c10::ArrayRef<std::string>, bool)

    PyCFunction_Call
    _PyObject_MakeTpCall
    _PyEval_EvalFrameDefault

    _PyEval_EvalFrameDefault
    _PyFunction_Vectorcall
    _PyEval_EvalFrameDefault
    _PyEval_EvalCodeWithName
    PyEval_EvalCode

    PyRun_SimpleFileExFlags
    Py_RunMain
    Py_BytesMain
    __libc_start_main
    _start
*** End stack trace ***

[2024-04-12 07:13:15,727] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 271334) of binary: /home/ubuntu/qwb_venv_pytorch/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/qwb_venv_pytorch/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=========================================================
run_simple_model.py FAILED

Environment information:

EC2 Instance: trn1.32.xlarge

OS: Ubuntu 20.04

Neuron Pytorch: Latest 2.18

wfckl789 commented 2 months ago

I guess is this because neuron cc doesn't support some tensor operations from nn.embedding()? I did some comparative experiments finding that replacing all layers with nn.Linear() is ok but remaining layyer nn.embedding() will raise crash.

jluntamazon commented 2 months ago

Hi @wfckl789, thank you for the example!

We were able to reproduce the problem and are looking into making a fix.

This is an internal error to the compiler that appears to occur with this specific sequence of instructions. We’ll update here when we have a fix available.