Open jacobhinkle opened 1 year ago
Thanks for reporting. Was the repro working before?
No I haven't got it working on any commit. I think the code in question was introduced in October: https://github.com/csarofeen/pytorch/pull/2072/files#diff-147f701ca808989bf9ad700751c261c21fc30cea6c0f434001cdc4c2ce0be042R767
Actually, that wasn't the original commit. The same code was already there. Looks like I added that: https://github.com/csarofeen/pytorch/commit/6d14059cd44247de6af8705e8ba843b65fe638e6
Will look into it.
Lol, also run into this with cross_entropy, where we have indices target with weight running with mean
reduction.
This pattern happens when we take_along_axis into weight to compute the divisor for mean
@jacobhinkle Is this fixed now that #192 is merged?
Getting same segfault. If I guard the nullptr deref at root_domain_map.cpp:753
then that TORCH_INTERNAL_ASSERT
still fails.
(Note to myself) #206 has a disabled test. Make sure to check the test as well
The following C++ version uses take_along_axis
to compute the original example, and it works properly following the merge of #240:
auto fusion = std::make_unique<Fusion>();
FusionGuard fg(fusion.get());
int64_t vocab_size = 300;
int64_t embedding_dim = 96;
int64_t sentence_len = 256;
int64_t batch_size = 20;
auto tv0 = makeSymbolicTensor(2, DataType::Int);
auto tv1 = makeSymbolicTensor(2);
fusion->addInput(tv0);
fusion->addInput(tv1);
auto tv2 = broadcast(tv0, {false, false, true, true});
auto tv3 = broadcast(tv1, {true, true, false, false});
//auto tv4 = torch_gather(tv3, 1, tv2);
auto tv4 = take_along_axis(tv3, tv2, 2);
auto tv5 = squeeze(tv4, std::vector<bool>{false, false, true, false});
fusion->addOutput(tv5);
auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
at::manual_seed(0);
auto terms = at::randint(0, vocab_size, {batch_size, sentence_len}, options.dtype(at::kLong));
auto embedding_table = at::randn({vocab_size, embedding_dim}, options);
std::vector<c10::IValue> aten_inputs({terms, embedding_table});
FusionExecutorCache fec(std::move(fusion));
auto cg_outputs = fec.runFusionWithInputs(aten_inputs);
auto ref = at::embedding(embedding_table, terms);
TORCH_CHECK(ref.equal(cg_outputs[0]));
Note in the above code that swapping in torch_gather
instead of take_along_axis
results in the following error:
C++ exception with description "root_ind != nullptr INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/index_compute.cpp":1983, please report a bug to PyTorch. Couldn't find root mapping for T3_g[ bS8{1}, bS9{1}, iS80{T3.size[2]}, iS81{T3.size[3]} ] dim: 2 id: iS80{T3.size[2]} Exception raised from getProducerRootIndices at /opt/pytorch/nvfuser/csrc/index_compute.cpp:1983
I have left this open for now since the segfault is replaced by that failed assertion. However, we could also close it as the workaround is to just use take_along_axis
unless we know we might need to shrink some of the non-index axes.
The following code results in a segfault as of yesterday (e.g. commit 1a5db862df21e5dabaeb0f3648a012ea60cee8c3)
A partial backtrace is here:
The problem comes from dereferencing
tv->definition()
without checking fornullptr
at https://github.com/NVIDIA/Fuser/blob/main/csrc/root_domain_map.cpp#L794. Changing this line toraises an informative (but uncaught) exception and the python script exits with a
RuntimeError
.