aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
457 stars 153 forks source link

Problems compiling existing PyTorch model #86

Closed duckontheweb closed 4 years ago

duckontheweb commented 4 years ago

I have been attempting to compile an existing, pre-trained PyTorch model using neuron-cc on a c5n.4xlarge instance. I'm loading the model from an existing checkpoint and then attempting to compile it in Python 3.6 using torch.neuron.trace according to the docs here.

The compilation is failing with the error log below. Any suggestions on how to troubleshoot this?

Thanks in advance!

Error Output

...
ERROR (Spiller): Same live vertices kept previously have not been freed for long!
list_sch: /opt/amazon/neuroncc/starfish/fast_sch/mem_alloc/spill/MemorySpillDuringSchedule.cpp:352: void MemorySpillDuringSchedule::CheckSpillerWork(std::set<long unsigned int>&, MemoryBase&): Assertion `0' failed.
Aborted (core dumped)
02/27/2020 10:47:58 PM ERROR [neuron-cc]: ***************************************************************
02/27/2020 10:47:58 PM ERROR [neuron-cc]:  An Internal Compiler Error has occurred
02/27/2020 10:47:58 PM ERROR [neuron-cc]: ***************************************************************
02/27/2020 10:47:58 PM ERROR [neuron-cc]:
02/27/2020 10:47:58 PM ERROR [neuron-cc]: Please contact Customer Support and provide the following details.
02/27/2020 10:47:58 PM ERROR [neuron-cc]:
02/27/2020 10:47:58 PM ERROR [neuron-cc]: Error message:  Non-zero exit status (134) for command: /home/ubuntu/test_venv/lib/python3.6/site-packages/neuroncc/starfish/bin/list_sch --hhir hh-tr-external-move.json --verbose 0 --sb_size 75 --arith_intensity_target 2300 --sb_watermark_low 0.250000 --sb_watermark_high 0.750000 --sb_size_tol 1 --alloc simple1 --alloc_opt --depth_diff 0.100000 --verbose_start_cycle 0 --tt_dist --mm_meet_cnt 1 --load_speed_factor 0.300000 --schir sch_tmp.json --spill_depth_limit 5 --threshold_consecutive_num_spills_same_keep_vertices 10 --true_dep --mm_order
02/27/2020 10:47:58 PM ERROR [neuron-cc]:
02/27/2020 10:47:58 PM ERROR [neuron-cc]: Error class:    CompilerInternalError
02/27/2020 10:47:58 PM ERROR [neuron-cc]: Error location: job.Scheduler.4
02/27/2020 10:47:58 PM ERROR [neuron-cc]: Command line:   /home/ubuntu/test_venv/bin/neuron-cc compile /tmp/tmp62j2gg8w/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmp62j2gg8w/graph_def.neff --io-config '{"inputs": {"input.1:0": [[1, 3, 256, 256], "float32"]}, "outputs": ["BiasAdd:0"]}'
02/27/2020 10:47:58 PM ERROR [neuron-cc]:
02/27/2020 10:47:58 PM ERROR [neuron-cc]: Internal details:
02/27/2020 10:47:58 PM ERROR [neuron-cc]:   File "neuroncc/driver/Job.py", line 207, in neuroncc.driver.Job.runSingleInputFn
02/27/2020 10:47:58 PM ERROR [neuron-cc]:   File "neuroncc/driver/jobs/Scheduler.py", line 59, in neuroncc.driver.jobs.Scheduler.Scheduler.runSingleInput
02/27/2020 10:47:58 PM ERROR [neuron-cc]:   File "neuroncc/driver/Job.py", line 145, in neuroncc.driver.Job.Job.shellCommand
02/27/2020 10:47:58 PM ERROR [neuron-cc]:
02/27/2020 10:47:58 PM ERROR [neuron-cc]: Version information:
02/27/2020 10:47:59 PM ERROR [neuron-cc]:   Neuron Compiler version 1.0.6801.0+6001944336
02/27/2020 10:47:59 PM ERROR [neuron-cc]:
02/27/2020 10:47:59 PM ERROR [neuron-cc]:   HWM version 1.0.839.0-6001300654
02/27/2020 10:47:59 PM ERROR [neuron-cc]:   NEFF version 0.6
02/27/2020 10:47:59 PM ERROR [neuron-cc]:   TVM version 1.0.1619.0+6001909371
02/27/2020 10:47:59 PM ERROR [neuron-cc]:   NumPy version 1.17.2
02/27/2020 10:47:59 PM ERROR [neuron-cc]:   MXNet not available
02/27/2020 10:47:59 PM ERROR [neuron-cc]:   TF version 1.15.0
02/27/2020 10:47:59 PM ERROR [neuron-cc]:   ONNX not available
02/27/2020 10:47:59 PM ERROR [neuron-cc]:
02/27/2020 10:47:59 PM ERROR [neuron-cc]: Artifacts stored in: /home/ubuntu
duckontheweb commented 4 years ago

I don't know if it's helpful, but if I try to run the command in the Command line ... line of the error logs I get a segmentation fault:

$ /home/ubuntu/test_venv/lib/python3.6/site-packages/neuroncc/starfish/bin/list_sch --hhir hh-tr-external-move.json --verbose 0 --sb_size 75 --arith_intensity_target 2300 --sb_watermark_low 0.250000 --sb_watermark_high 0.750000 --sb_size_tol 1 --alloc simple1 --alloc_opt --depth_diff 0.100000 --verbose_start_cycle 0 --tt_dist --mm_meet_cnt 1 --load_speed_factor 0.300000 --schir sch_tmp.json --spill_depth_limit 5 --threshold_consecutive_num_spills_same_keep_vertices 10 --true_dep --mm_order
/home/ubuntu/test_venv/lib/python3.6/site-packages/neuroncc/starfish/bin/list_sch --hhir hh-tr-external-move.json --verbose 0 --sb_size 75 --arith_intensity_target 2300 --sb_watermark_low 0.250000 --sb_watermark_high 0.750000 --sb_size_tol 1 --alloc simple1 --alloc_opt --depth_diff 0.100000 --verbose_start_cycle 0 --tt_dist --mm_meet_cnt 1 --load_speed_factor 0.300000 --schir sch_tmp.json --spill_depth_limit 5 --threshold_consecutive_num_spills_same_keep_vertices 10 --true_dep --mm_order
INFO:Finished Reading HHIR Json file...
INFO:Started Construction of IRS Graph...
INFO:Finished Construction of IRS Graph...
INFO:Reading Tensor Map Json file...
INFO:Finished Reading Tensor Map Json file...
INFO:Started Construction of TIR Data... Thu Feb 27 23:14:55 2020
INFO:Finished Construction of TIR Data... Thu Feb 27 23:14:55 2020
INFO:Started Construction of TIR Graph... Thu Feb 27 23:14:55 2020
INFO (IRSInterface::Total Ops) :0
INFO (IRSInterface::CountDanglingNodes) :
    Number of dangling nodes = 0
    Memory Usage of dangling nodes = 0 Bytes
INFO (IRSInterface::CountElemwiseOp) :
    Number of elemwise ops = 0
INFO (IRSInterface::Average Fanouts of ElemwiseOp) :-nan
INFO (IRSInterface::Average Number of ElemwiseOp Consumer) :-nan
INFO (IRSInterface::CountSingleConsumerElemwiseOp) :
    Number of single consumer elemwise ops = 0
INFO (IRSInterface::Num TTs with TT Srcs): 0
INFO (IRSInterface::Num TTs with MM Srcs): 0
INFO (IRSInterface::Num TTs with TT AND MM Srcs): 0
INFO (IRSInterface::Num TTs with MM AND MM Srcs): 0
INFO (IRSInterface::Average Partition Usage of MM) : -nan
INFO:Finished Construction of TIR Graph... Thu Feb 27 23:14:55 2020
INFO: Started ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Finished ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Started ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Finished ComputeDepth...Thu Feb 27 23:14:55 2020
INFO (PriorityFunction): Started ComputeProximity...Thu Feb 27 23:14:55 2020
INFO (PriorityFunction): Finished ComputeProximity...Thu Feb 27 23:14:55 2020
Initializing RT...
INFO (Tensor Init): Adding Data Dependency from the init lists of DNs
INFO:Starting Scheduling...Thu Feb 27 23:14:55 2020
Segmentation fault (core dumped)
aws-taylor commented 4 years ago

Hello duckontheweb,

I'm sorry for the inconvenience; we've opened a ticket internally to track this issue. Before we can do much, we'll need more information. If you're able to share your model, that's the fastest way for us to be able to reproduce the issue. If you have sensitive IP, consider opening an AWS support ticket and sharing there.

In the mean time, I'd suggest you configure your system to dump core files and look for hints in the resulting stack trace.

ulimit -c unlimited //Turn on core files
//Run the program. A 'core.xxxx' files should be produced.
file core.xxxx // Check which command 'command' created the core file. 
gdb <command> core.xxx // Fire up GDB to open the core file
bt // Look at the stack trace for hints.

I've found that often times ABI incompatibilities can result in segmentation faults like this, especially for binaries embedded in python wheels. Re-installing python dependencies from source can help in these cases.

pip install --force-reinstall --no-binary <dependency>

Hopefully this helps a little.

Regards, Taylor

aws-taylor commented 4 years ago

Hello again duckontheweb,

I see that you're running version 1.0.6801.0+6001944336. There is a newer version available. In addition to the advice above, I would encourage you to update pip/apt/yum/conda as appropriate and try again.

Regards, Taylor

duckontheweb commented 4 years ago

@aws-taylor Thanks for the quick reply! I'll follow your suggestions for:

I'm also going to try re-training our model using the PyTorch version that comes with torch_neuron and compiling all in the same script to see if that helps. I'll let you know what I find.

The model is IP, as you mentioned, so if those steps don't work I'll open up a support ticket and pursue it there. Should I reference this issue in any way in the support ticket?

aws-taylor commented 4 years ago

Hello duckontheweb,

>>Should I reference this issue in any way in the support ticket? Yes, please reference this issue in any support ticket to ensure it is routed correctly.

-Taylor

duckontheweb commented 4 years ago

Thanks.

So, for my latest attempt, I tried re-training the model in an environment with all of the neuron libraries installed. The training went fine, but then I get this error when trying to run torch.neuron.trace:

Traceback (most recent call last):
  File "model/train_model.py", line 360, in <module>
    example_inputs=[dummy_image]
  File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 150, in trace
    transform_torch_graph_to_tensorflow( func, example_inputs, args, kwargs )
  File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 288, in transform_torch_graph_to_tensorflow
    input_calls_map = get_input_calls_map(jit_trace.graph, example_inputs)
  File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 780, in get_input_calls_map
    func = _resolve_func(node)
  File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 957, in _resolve_func
    assert hasattr(module, func_name), "Neuron compile failed.  Operator {}::{} is not supported".format(mod_name,func_name)
AssertionError: Neuron compile failed.  Operator prim::PythonOp is not supported

Is that an indication that we're just using an unsupported model architecture?

duckontheweb commented 4 years ago

I see that you're running version 1.0.6801.0+6001944336. There is a newer version available. In addition to the advice above, I would encourage you to update pip/apt/yum/conda as appropriate and try again.

I've been installing neuron-cc using:

$ pip install -U pip
$ pip install neuron-cc --extra-index-url https://pip.repos.neuron.amazonaws.com

Is there a specific newer version that I should try to get for Python 3.6, or is it better to install it via apt?

aws-taylor commented 4 years ago

You may need to pass the --upgrade flag and possibly the --force-reinstall flag since you already have the software installed. Pip should be sufficient.

-Taylor

duckontheweb commented 4 years ago

Thanks. I've been installing this on a clean EC2 each time, so the flags didn't seem to have any effect.

I was able to make some progress. I realized that I had forgotten the --no-deps option on torchvision the last time around. Re-installing all of that led to a successful compilation of one of the models! For some reason it is not recognizing the CUDA version, though:

import torch
import torch_neuron

torch.__version__
# '1.3.0.1.0.90.0'

torch.version.cuda
# None 

torch.cuda.device_count()
# 0

The default CUDA version for this AMI is 10.0, I'll play around with it and see if I have a mismatch in versions or something.

duckontheweb commented 4 years ago

So, it looks like I'm able to compile the model as long as I train it using the environment that I've set up for Neuron. This should help us get a little farther along. Thanks for the help!

awsrjh commented 4 years ago

great - thanks for the info! Will close this.... feel free to reopen if you have any more issues.