Pointer-generator crashes when source and target are disjoint

Othergreengrasses commented 4 months ago

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage self._run_train() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1204, in _run_train self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1276, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance output = self._evaluation_step(kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step output = self.trainer._call_strategy_hook(hook_name, kwargs.values()) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step return self.model.validation_step(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/yoyodyne/models/base.py", line 286, in validation_step greedy_predictions = self(batch) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.10/dist-packages/yoyodyne/models/pointer_generator.py", line 412, in forward predictions = self.decode( File "/usr/local/lib/python3.10/dist-packages/yoyodyne/models/pointer_generator.py", line 345, in decode output, decoder_hiddens = self.decode_step( File "/usr/local/lib/python3.10/dist-packages/yoyodyne/models/pointer_generator.py", line 286, in decode_step gen_probs = self.generation_probability(context, hidden, embedded) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.10/dist-packages/yoyodyne/models/pointer_generator.py", line 72, in forward p_gen += self.W_emb(target_embeddings) + self.bias.expand( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/bin/yoyodyne-train", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/yoyodyne/train.py", line 338, in main best_checkpoint = train(trainer, model, datamodule, args.train_from) File "/usr/local/lib/python3.10/dist-packages/yoyodyne/train.py", line 232, in train trainer.fit(model, datamodule, ckpt_path=train_from) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 63, in _call_and_handle_interrupt trainer._teardown() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1175, in _teardown self.strategy.teardown() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 499, in teardown self.accelerator.teardown() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/accelerators/cuda.py", line 75, in teardown _clear_cuda_memory() File "/usr/local/lib/python3.10/dist-packages/lightning_fabric/accelerators/cuda.py", line 366, in _clear_cuda_memory torch.cuda.empty_cache() File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 125, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

kylebgorman commented 4 months ago

Could you provide us data and code necessary to reproduce this? Perhaps as a Gist...

Othergreengrasses commented 4 months ago

https://gist.github.com/Othergreengrasses/d00c4a0e29e4331f21ab907469b19cea

Please note that the .tsv files are sample files.

kylebgorman commented 4 months ago

This is probably not germane to the issue---which I'm trying to reproduce now---but you repeat the gradient_clip_val argument twice. You also forget to tell the system that you have a space between each character: see here. You want --source_sep ' ' --target_sep ' '.

Othergreengrasses commented 4 months ago

Well, I got the same error after implementing your suggestions.

kylebgorman commented 4 months ago

I suspect I will as well; will try when I get home.

On Fri, Feb 2, 2024 at 10:40 AM Arundhati Sengupta @.***> wrote:

Well, I got the same error after implementing your suggestions.

— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/yoyodyne/issues/156#issuecomment-1924122500, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OK2YK2RO422FYAEBU3YRUCH5AVCNFSM6AAAAABCT7ZDLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUGEZDENJQGA . You are receiving this because you commented.Message ID: @.***>

Adamits commented 4 months ago

I tried to reproduce this on CPU and got a different error

  File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/pointer_generator.py", line 284, in decode_step
    ptr_dist.scatter_add_(
RuntimeError: index 31 is out of bounds for dimension 2 with size 30

This is familiar---I believe that essentially this is trying to assign probability to a target index from a source index (what the 'pointer' module does), and it is oov. I recall this issue coming up after the big refactor to modularize encoders and decoders, and that we solved it by just merging the source and target vocabularies for pointer generators (is that fix gone now?).

However, this begs the question: why would you use a pointer generator when the source and target vocabularies are disjoint? Unless you are adding some special mapping from source to target items?

I am also not sure why the error on GPU throws at the next line, unless you have made some changes.

kylebgorman commented 4 months ago

Thanks for looking into this @Adamits. (Sorry @Othergreengrasses I wasn't in front of a computer with a GPU, so I was just looking at the CLI arguments.) Thanks for finding this bug @Othergreengrasses, this is a serious one we should address.

So Adam makes a good point: this is not a good case for a pointer-generator since the vocabularies are ("completely") disjoint. (And this is maybe my fault because I think I suggested a pointer-generator for the Bengali data, and now I see it doesn't make sense at least for this...this is the Google data, right?) That said, we should have a sensible error instead of a messy one; it was not at all obvious to me what the error was from that stack trace, nor would I expect it to be to anyone else.

Maybe we should detect the vocabulary disjointness and error out with a readable error. What do you think? Implementationally there's just the question of what bit of the trainer is responsible for throwing this error---it has to be something that can see the architecture type, but also the vocabularies/index. One possibility is to add some kind of simple "trait" system to the architectures so that we can do something like:

index = ...
model_class = ...
if model_class.requires_overlapping_vocabulary and not index.has_overlapping_vocabulary:
   raise ModelDataIncompatibilityError(f"{model_class} requires an overlapping vocabulary")

There are a few other misconfigurations things like this we might want to detect of a similar ilk: i.e., if you request a feature-invariant transformer but don't have features...(I don't know what happens in that case. Maybe it works and just trains an ordinary transformer, or maybe it crashes with an uninformative error. But probably better to crash with an informative one.)

bonham79 commented 4 months ago

Hey y'all, asked for this error to be put up. I think the main issue is this line:

p_gen += self.W_emb(target_embeddings) + self.bias.expand(...

Pytorch gets angry with these in-place operation.

kylebgorman commented 4 months ago

Hey y'all, asked for this error to be put up. I think the main issue is this line:

p_gen += self.W_emb(target_embeddings) + self.bias.expand(...

Pytorch gets angry with these in-place operation.

I don't think it should work anyways; it should be an error.

Adamits commented 4 months ago

p_gen += self.W_emb(target_embeddings) + self.bias.expand(...

Can you reproduce this with other data? I thought this issue turns up during backprop. As far as I can tell, this is happening during validation in the initial sanity check.

EDIT: though what you are saying does make sense given it complains about _clear_cuda_memory. I just don't think we have seen this before? Could it be the implementation for summing tensors particular to your kernel? But even if you update this to e.g. p_gen_out = p_gen + ... I would expect this to error for other reasons I think?

bonham79 commented 4 months ago

Yeah you're right, the initial issue with disjoint vocabulary would be the problem in validation. I only noticed the in place operation issue. Two problems in one?

kylebgorman commented 4 months ago

My $.02 is that when they work, in-place operations are honkin' great, but sometimes they have a weird effect on the computation graph.

CUNY-CL / yoyodyne

Pointer-generator crashes when source and target are disjoint #156