ezyang / pytorch-unattached

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
20 stars 8 forks source link

JIT rebase #223

Closed ezyang closed 7 years ago

ezyang commented 7 years ago

This rebases JIT onto master; most notably, it rebases onto the Variable/ATen changes from @colesbury.


Below is obsolete.

Unfortunately, the rebase segfaults.

gdb --args python test/test_jit.py TestJit.test_backward_closure
Thread 3 "python" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 1967234]
torch::autograd::EvalOutput::EvalOutput (this=0x7fffb4003478, next_edge=...)
    at /data/users/ezyang/pytorch/torch/csrc/autograd/functions/special.h:23
23          input_sizes.emplace_back(next_edge.first->input_sizes.at(next_edge.second));
#0  torch::autograd::EvalOutput::EvalOutput (this=0x7fffb4003478, next_edge=...)                                    [0/83444]
    at /data/users/ezyang/pytorch/torch/csrc/autograd/functions/special.h:23
#1  0x00007fffe6cc6430 in std::_Sp_counted_ptr_inplace<torch::autograd::EvalOutput, std::allocator<torch::autograd::EvalOutpu
t>, 2>::_Sp_counted_ptr_inplace<std::pair<std::shared_ptr<torch::autograd::Function>, int> const&> (this=<optimized out>, 
    __a=..., __args=...)
    at /bin/../lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/bits/shared_ptr_base.h:396
#2  __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<torch::autograd::EvalOutput, std::allocator<torch::autograd::EvalOu
tput>, (__gnu_cxx::_Lock_policy)2> >::construct<std::_Sp_counted_ptr_inplace<torch::autograd::EvalOutput, std::allocator<torc
h::autograd::EvalOutput>, (__gnu_cxx::_Lock_policy)2>, std::allocator<torch::autograd::EvalOutput> const, std::pair<std::shar
ed_ptr<torch::autograd::Function>, int> const&>(std::_Sp_counted_ptr_inplace<torch::autograd::EvalOutput, std::allocator<torc
h::autograd::EvalOutput>, (__gnu_cxx::_Lock_policy)2>*, std::allocator<torch::autograd::EvalOutput> const&&, std::pair<std::s
hared_ptr<torch::autograd::Function>, int> const&) (this=<optimized out>, __p=<optimized out>, __args=..., __args=...)
    at /bin/../lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/ext/new_allocator.h:120
#3  std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<torch::autograd::EvalOutput, std::allocator<torch::auto
grad::EvalOutput>, (__gnu_cxx::_Lock_policy)2> > >::_S_construct<std::_Sp_counted_ptr_inplace<torch::autograd::EvalOutput, st
d::allocator<torch::autograd::EvalOutput>, (__gnu_cxx::_Lock_policy)2>, std::allocator<torch::autograd::EvalOutput> const, st
d::pair<std::shared_ptr<torch::autograd::Function>, int> const&> (__a=..., __p=<optimized out>, __args=..., __args=...)
    at /bin/../lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/bits/alloc_traits.h:254
#4  std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<torch::autograd::EvalOutput, std::allocator<torch::auto
grad::EvalOutput>, (__gnu_cxx::_Lock_policy)2> > >::construct<std::_Sp_counted_ptr_inplace<torch::autograd::EvalOutput, std::
allocator<torch::autograd::EvalOutput>, (__gnu_cxx::_Lock_policy)2>, std::allocator<torch::autograd::EvalOutput> const, std::
pair<std::shared_ptr<torch::autograd::Function>, int> const&> (__a=..., __p=<optimized out>, __args=..., __args=...)
    at /bin/../lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/bits/alloc_traits.h:393
#5  std::__shared_count<2>::__shared_count<torch::autograd::EvalOutput, std::allocator<torch::autograd::EvalOutput>, std::pai
r<std::shared_ptr<torch::autograd::Function>, int> const&> (this=0x7fffbbb356f8, __a=..., __args=...)
    at /bin/../lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/bits/shared_ptr_base.h:502
#6  0x00007fffe6cc632d in std::__shared_ptr<torch::autograd::EvalOutput, 2>::__shared_ptr<std::allocator<torch::autograd::Eva
lOutput>, std::pair<std::shared_ptr<torch::autograd::Function>, int> const&> (this=0x7fffbbb356f0, __a=..., __args=..., 
    __tag=...) at /bin/../lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/bits/shared_ptr_base.h:958
#7  0x00007fffe6cbffa8 in std::unordered_set<std::pair<std::shared_ptr<torch::autograd::Function>, int>, torch::autograd::edg
e_hasher, std::equal_to<std::pair<std::shared_ptr<torch::autograd::Function>, int> >, std::allocator<std::pair<std::shared_pt
r<torch::autograd::Function>, int> > >::begin (this=<optimized out>)
    at /bin/../lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/bits/hashtable_policy.h:258
#8  torch::autograd::Eval::replaceSubgraph (this=0x7fffb4003368, inputs=..., _outputs=..., inherited_placeholders=...)
    at torch/csrc/autograd/functions/special.cpp:152
#9  0x00007fffe6cc119c in torch::autograd::Eval::apply (this=0x1292958, inputs=...)
    at torch/csrc/autograd/functions/special.cpp:199
#10 0x00007fffe6c3ce2a in torch::jit::tracer::(anonymous namespace)::TraceEval::apply (this=0x1292958, inputs=...)
    at torch/csrc/jit/tracer.cpp:39
#11 0x00007fffe6c5e4cc in torch::autograd::Function::operator() (this=0x1292958, inputs=...)
    at /data/users/ezyang/pytorch/torch/csrc/autograd/function.h:111
#12 0x00007fffe6c57c8d in torch::autograd::call_function (task=..., task=...) at torch/csrc/autograd/engine.cpp:208
#13 torch::autograd::Engine::evaluate_function (this=0x7fffe7f41b38 <engine>, task=...)
    at torch/csrc/autograd/engine.cpp:220
#14 0x00007fffe6c5725c in torch::autograd::Engine::thread_main (this=0x7fffe7f41b38 <engine>, graph_task=0x0)
    at torch/csrc/autograd/engine.cpp:144
#15 0x00007fffe6c5714c in torch::autograd::Engine::thread_init (this=0x7fffe7f41b38 <engine>, device=-1)
    at torch/csrc/autograd/engine.cpp:121
#16 0x00007fffe6c71256 in torch::autograd::python::PythonEngine::thread_init (this=0x7fffe7f41b38 <engine>, device=-1)
    at torch/csrc/autograd/python_engine.cpp:28
#17 0x00007fffc70ba920 in ?? () from /home/ezyang/local/pytorch/env/lib/libstdc++.so.6
#18 0x00007ffff76bfdc5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007ffff6add76d in clone () from /lib64/libc.so.6

I double checked the diff and it looked fine: https://phabricator.intern.facebook.com/P58196566 (this diff was created by diffing this PR against a merge of ezyang/jit (b4ad6de43c030f3bbbb85e21f52efe7c268de974) and origin/master (1290e586fbc3d6266423f3417723d6620267054b) for which we didn't resolve any conflicts and just took the conflict markers.) So some debugging seems to be in order.

The other tests that segfault are:

apaszke commented 7 years ago

The segfault you linked here was fixed in the SimpleEval PR: https://github.com/ezyang/pytorch/pull/215/files#diff-075c575cc319a2db2a532dcaeb93c403R23

ezyang commented 7 years ago

Aaaand that's why I post things up like this :) Any idea why this isn't failing on ezyang/jit though?

ezyang commented 7 years ago

It doesn't segfault anymore.

apaszke commented 7 years ago

Yeah. Sam reverted a few of my changes, that made it more likely for certain graph edges to be NULL.

apaszke commented 7 years ago

Can you just squash the Bug fix commit into Add simple mode to Eval? I put it as a separate commit just for the code review, but wanted to squash when merging

ezyang commented 7 years ago

Bug fix squashed, and everything merged.