ezyang / pytorch-unattached

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
20 stars 8 forks source link

Reproduction for TensorMetas problem #232

Closed ezyang closed 7 years ago

ezyang commented 7 years ago

I took Zach's PR at https://github.com/ezyang/pytorch/pull/224 and reproduced his problem on word_language_model. While debugging, I noted that TensorMeta was not initialized correctly, and I fixed this, though this did not fix word language model: "TensorMeta uninitialized member bugfix." Then, I added two minimized test cases from WLM which induce the TensorMeta failure: the problem is that backwards returns None as a grad_input for an input that was a leaf variable. Returning null means we attempt to reconstitute a TensorMeta, but because the input is a leaf variable, input_sizes is never set on its next_function.

I experimented with a fix that deletes TensorMetas altogether: "DO NOT MERGE. Proof of concept alternative to TensorMetas." but this still fails on WLM. The minimal repro is "Minimized, failing test case".

Finally, "Take out GIL before printing Python scalars." and "Dump the forward only trace once we have it." are two minor fixes which make debugging easier (solving an observed segfault when dumping graphs in C++ code, and the fact that backwards failure meant PYTORCH_JIT_DUMP=1 did not dump any environment variables).

ezyang commented 7 years ago

My general commentary: TensorMetas seems to be the wrong way to solve the null variable problem, because if a function returns null variables for some of its outputs, there's no way for the variable to continue participating in traces (it's None, after all!) so there's no point in twisting ourselves up into knots to turn it into a real variable. However, for some reason, my simple patch doesn't work.

apaszke commented 7 years ago

Fix is in #233.