Closed duckontheweb closed 4 years ago
I don't know if it's helpful, but if I try to run the command in the Command line ...
line of the error logs I get a segmentation fault:
$ /home/ubuntu/test_venv/lib/python3.6/site-packages/neuroncc/starfish/bin/list_sch --hhir hh-tr-external-move.json --verbose 0 --sb_size 75 --arith_intensity_target 2300 --sb_watermark_low 0.250000 --sb_watermark_high 0.750000 --sb_size_tol 1 --alloc simple1 --alloc_opt --depth_diff 0.100000 --verbose_start_cycle 0 --tt_dist --mm_meet_cnt 1 --load_speed_factor 0.300000 --schir sch_tmp.json --spill_depth_limit 5 --threshold_consecutive_num_spills_same_keep_vertices 10 --true_dep --mm_order
/home/ubuntu/test_venv/lib/python3.6/site-packages/neuroncc/starfish/bin/list_sch --hhir hh-tr-external-move.json --verbose 0 --sb_size 75 --arith_intensity_target 2300 --sb_watermark_low 0.250000 --sb_watermark_high 0.750000 --sb_size_tol 1 --alloc simple1 --alloc_opt --depth_diff 0.100000 --verbose_start_cycle 0 --tt_dist --mm_meet_cnt 1 --load_speed_factor 0.300000 --schir sch_tmp.json --spill_depth_limit 5 --threshold_consecutive_num_spills_same_keep_vertices 10 --true_dep --mm_order
INFO:Finished Reading HHIR Json file...
INFO:Started Construction of IRS Graph...
INFO:Finished Construction of IRS Graph...
INFO:Reading Tensor Map Json file...
INFO:Finished Reading Tensor Map Json file...
INFO:Started Construction of TIR Data... Thu Feb 27 23:14:55 2020
INFO:Finished Construction of TIR Data... Thu Feb 27 23:14:55 2020
INFO:Started Construction of TIR Graph... Thu Feb 27 23:14:55 2020
INFO (IRSInterface::Total Ops) :0
INFO (IRSInterface::CountDanglingNodes) :
Number of dangling nodes = 0
Memory Usage of dangling nodes = 0 Bytes
INFO (IRSInterface::CountElemwiseOp) :
Number of elemwise ops = 0
INFO (IRSInterface::Average Fanouts of ElemwiseOp) :-nan
INFO (IRSInterface::Average Number of ElemwiseOp Consumer) :-nan
INFO (IRSInterface::CountSingleConsumerElemwiseOp) :
Number of single consumer elemwise ops = 0
INFO (IRSInterface::Num TTs with TT Srcs): 0
INFO (IRSInterface::Num TTs with MM Srcs): 0
INFO (IRSInterface::Num TTs with TT AND MM Srcs): 0
INFO (IRSInterface::Num TTs with MM AND MM Srcs): 0
INFO (IRSInterface::Average Partition Usage of MM) : -nan
INFO:Finished Construction of TIR Graph... Thu Feb 27 23:14:55 2020
INFO: Started ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Finished ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Started ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Finished ComputeDepth...Thu Feb 27 23:14:55 2020
INFO (PriorityFunction): Started ComputeProximity...Thu Feb 27 23:14:55 2020
INFO (PriorityFunction): Finished ComputeProximity...Thu Feb 27 23:14:55 2020
Initializing RT...
INFO (Tensor Init): Adding Data Dependency from the init lists of DNs
INFO:Starting Scheduling...Thu Feb 27 23:14:55 2020
Segmentation fault (core dumped)
Hello duckontheweb,
I'm sorry for the inconvenience; we've opened a ticket internally to track this issue. Before we can do much, we'll need more information. If you're able to share your model, that's the fastest way for us to be able to reproduce the issue. If you have sensitive IP, consider opening an AWS support ticket and sharing there.
In the mean time, I'd suggest you configure your system to dump core files and look for hints in the resulting stack trace.
ulimit -c unlimited //Turn on core files
//Run the program. A 'core.xxxx' files should be produced.
file core.xxxx // Check which command 'command' created the core file.
gdb <command> core.xxx // Fire up GDB to open the core file
bt // Look at the stack trace for hints.
I've found that often times ABI incompatibilities can result in segmentation faults like this, especially for binaries embedded in python wheels. Re-installing python dependencies from source can help in these cases.
pip install --force-reinstall --no-binary <dependency>
Hopefully this helps a little.
Regards, Taylor
Hello again duckontheweb,
I see that you're running version 1.0.6801.0+6001944336. There is a newer version available. In addition to the advice above, I would encourage you to update pip/apt/yum/conda as appropriate and try again.
Regards, Taylor
@aws-taylor Thanks for the quick reply! I'll follow your suggestions for:
neuron-cc
I'm also going to try re-training our model using the PyTorch version that comes with torch_neuron and compiling all in the same script to see if that helps. I'll let you know what I find.
The model is IP, as you mentioned, so if those steps don't work I'll open up a support ticket and pursue it there. Should I reference this issue in any way in the support ticket?
Hello duckontheweb,
>>Should I reference this issue in any way in the support ticket? Yes, please reference this issue in any support ticket to ensure it is routed correctly.
-Taylor
Thanks.
So, for my latest attempt, I tried re-training the model in an environment with all of the neuron libraries installed. The training went fine, but then I get this error when trying to run torch.neuron.trace
:
Traceback (most recent call last):
File "model/train_model.py", line 360, in <module>
example_inputs=[dummy_image]
File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 150, in trace
transform_torch_graph_to_tensorflow( func, example_inputs, args, kwargs )
File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 288, in transform_torch_graph_to_tensorflow
input_calls_map = get_input_calls_map(jit_trace.graph, example_inputs)
File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 780, in get_input_calls_map
func = _resolve_func(node)
File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 957, in _resolve_func
assert hasattr(module, func_name), "Neuron compile failed. Operator {}::{} is not supported".format(mod_name,func_name)
AssertionError: Neuron compile failed. Operator prim::PythonOp is not supported
Is that an indication that we're just using an unsupported model architecture?
I see that you're running version 1.0.6801.0+6001944336. There is a newer version available. In addition to the advice above, I would encourage you to update pip/apt/yum/conda as appropriate and try again.
I've been installing neuron-cc
using:
$ pip install -U pip
$ pip install neuron-cc --extra-index-url https://pip.repos.neuron.amazonaws.com
Is there a specific newer version that I should try to get for Python 3.6, or is it better to install it via apt
?
You may need to pass the --upgrade
flag and possibly the --force-reinstall
flag since you already have the software installed. Pip should be sufficient.
-Taylor
Thanks. I've been installing this on a clean EC2 each time, so the flags didn't seem to have any effect.
I was able to make some progress. I realized that I had forgotten the --no-deps
option on torchvision the last time around. Re-installing all of that led to a successful compilation of one of the models! For some reason it is not recognizing the CUDA version, though:
import torch
import torch_neuron
torch.__version__
# '1.3.0.1.0.90.0'
torch.version.cuda
# None
torch.cuda.device_count()
# 0
The default CUDA version for this AMI is 10.0, I'll play around with it and see if I have a mismatch in versions or something.
So, it looks like I'm able to compile the model as long as I train it using the environment that I've set up for Neuron. This should help us get a little farther along. Thanks for the help!
great - thanks for the info! Will close this.... feel free to reopen if you have any more issues.
I have been attempting to compile an existing, pre-trained PyTorch model using neuron-cc on a c5n.4xlarge instance. I'm loading the model from an existing checkpoint and then attempting to compile it in Python 3.6 using
torch.neuron.trace
according to the docs here.The compilation is failing with the error log below. Any suggestions on how to troubleshoot this?
Thanks in advance!
Error Output