Closed Sileadim closed 4 years ago
Thanks for pointing out that you couldn't find anything in the release notes - I just updated them. The problem with master is that you need to remove the sorting_keys
parameter from your config file.
Hi thanks for the quick reponse, but just removing sorting_keys leads to this (at least with driver 430.50, cuda 10.1 and pytorch 1.4.0):
2020-02-18 16:27:18,268 - INFO - allennlp.data.iterators.bucket_iterator - No sorting keys given; trying to guess a good one
2020-02-18 16:27:18,671 - INFO - allennlp.data.iterators.bucket_iterator - Using [('source_tokens', 'token_characters___num_token_characters')] as the sorting keys
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [28,0,0], thread: [96,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [28,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [64,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [65,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [66,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [67,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [68,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [69,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
Traceback (most recent call last):
File "/home/cehmann/repos/allennlp/env/bin/allennlp", line 11, in <module>
load_entry_point('allennlp', 'console_scripts', 'allennlp')()
File "/home/cehmann/repos/allennlp/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/home/cehmann/repos/allennlp/allennlp/commands/__init__.py", line 93, in main
args.func(args)
File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 144, in train_model_from_args
dry_run=args.dry_run,
File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 203, in train_model_from_file
dry_run=dry_run,
File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 266, in train_model
dry_run=dry_run,
File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 460, in _train_worker
metrics = train_loop.run()
File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 521, in run
return self.trainer.train()
File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 612, in train
train_metrics = self._train_epoch(epoch)
File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 391, in _train_epoch
loss = self.batch_loss(batch, for_training=True)
File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 302, in batch_loss
output_dict = self._pytorch_model(**batch)
File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/cehmann/repos/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 199, in forward
state = self._encode(source_tokens)
File "/home/cehmann/repos/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 317, in _encode
embedded_input = self._source_embedder(source_tokens)
File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/cehmann/repos/allennlp/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 83, in forward
token_vectors = embedder(list(tensors.values())[0], **forward_params_values)
File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/cehmann/repos/allennlp/allennlp/modules/token_embedders/token_characters_encoder.py", line 35, in forward
return self._dropout(self._encoder(self._embedding(token_characters), mask))
File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/cehmann/repos/allennlp/allennlp/modules/time_distributed.py", line 52, in forward
reshaped_outputs = self._module(*reshaped_inputs, **reshaped_kwargs)
File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/cehmann/repos/allennlp/allennlp/modules/seq2vec_encoders/pytorch_seq2vec_wrapper.py", line 76, in forward
self._module, inputs, mask, hidden_state
File "/home/cehmann/repos/allennlp/allennlp/modules/encoder_base.py", line 97, in sort_and_run_forward
num_valid = torch.sum(mask[:, 0]).int().item()
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [0,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [1,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [2,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [3,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [4,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [5,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [8,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [9,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
0%| | 0/157 [00:00<?, ?it/s]
Hmm, that one I don't know an answer to without a lot more digging. I don't know why that line that's just summing the mask would trigger that CUDA error, unless either (1) the mask is empty, or (2) the CUDA error is actually from a different line.
Probably related to this: https://github.com/pytorch/pytorch/issues/1204
Update: I continued working with the v0.9.0 + hotfix branch, but at some test example I ran into following issue (after disabling cuda):
Traceback (most recent call last):
File "/home/cehmann/allennlp/env/bin/allennlp", line 11, in <module>
load_entry_point('allennlp', 'console_scripts', 'allennlp')()
File "/home/cehmann/allennlp/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/home/cehmann/allennlp/allennlp/commands/__init__.py", line 102, in main
args.func(args)
File "/home/cehmann/allennlp/allennlp/commands/predict.py", line 227, in _predict
manager.run()
File "/home/cehmann/allennlp/allennlp/commands/predict.py", line 206, in run
for model_input_json, result in zip(batch_json, self._predict_json(batch_json)):
File "/home/cehmann/allennlp/allennlp/commands/predict.py", line 151, in _predict_json
results = [self._predictor.predict_json(batch_data[0])]
File "/home/cehmann/allennlp/allennlp/predictors/predictor.py", line 65, in predict_json
return self.predict_instance(instance)
File "/home/cehmann/allennlp/allennlp/predictors/predictor.py", line 181, in predict_instance
outputs = self._model.forward_on_instance(instance)
File "/home/cehmann/allennlp/allennlp/models/model.py", line 124, in forward_on_instance
return self.forward_on_instances([instance])[0]
File "/home/cehmann/allennlp/allennlp/models/model.py", line 153, in forward_on_instances
outputs = self.decode(self(**model_input))
File "/home/cehmann/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/cehmann/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 204, in forward
predictions = self._forward_beam_search(state)
File "/home/cehmann/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 519, in _forward_beam_search
start_predictions, state, self.take_search_step)
File "/home/cehmann/allennlp/allennlp/nn/beam_search.py", line 111, in search
start_class_log_probabilities, state = step(start_predictions, start_state)
File "/home/cehmann/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 745, in take_search_step
input_choices, selective_weights = self._get_input_and_selective_weights(last_predictions, state)
File "/home/cehmann/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 590, in _get_input_and_selective_weights
adjusted_prediction_ids = source_token_ids.gather(-1, adjusted_predictions.unsqueeze(-1))
RuntimeError: Invalid index in gather at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:657
After this setback, I tried to do some more digging on why master was not working. After disabling cuda there I got:
2020-02-20 17:30:55,599 - INFO - allennlp.data.iterators.bucket_iterator - No sorting keys given; trying to guess a good one
0%| | 0/1250 [00:00<?, ?it/s]2020-02-20 17:30:56,662 - INFO - allennlp.data.iterators.bucket_iterator - Using [('source_tokens', 'token_characters___num_token_characters')] as the sorting keys
Traceback (most recent call last):
File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 460, in _train_worker
metrics = train_loop.run()
File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 521, in run
return self.trainer.train()
File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 612, in train
train_metrics = self._train_epoch(epoch)
File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 391, in _train_epoch
loss = self.batch_loss(batch, for_training=True)
File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 302, in batch_loss
output_dict = self._pytorch_model(**batch)
File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/cehmann/repos/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 199, in forward
state = self._encode(source_tokens)
File "/home/cehmann/repos/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 317, in _encode
embedded_input = self._source_embedder(source_tokens)
File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/cehmann/repos/allennlp/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 83, in forward
token_vectors = embedder(list(tensors.values())[0], **forward_params_values)
File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/cehmann/repos/allennlp/allennlp/modules/token_embedders/token_characters_encoder.py", line 36, in forward
embedded = self._embedding(token_characters)
File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/cehmann/repos/allennlp/allennlp/modules/time_distributed.py", line 52, in forward
reshaped_outputs = self._module(*reshaped_inputs, **reshaped_kwargs)
File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/cehmann/repos/allennlp/allennlp/modules/token_embedders/embedding.py", line 186, in forward
sparse=self.sparse,
File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range: Tried to access index 2 out of table with 1 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418
0%| | 0/1250 [00:04<?, ?it/s]
Apparently there is some wrong index in the embedding, so I disabled character embeddings and this made the code run. So I'm wondering:
What changed from 0.9.0 -> master regarding the character embeddings? And is the problem with the test example from 0.9.0 related to the problem in master? If so, why would the training run in 0.9.0 and not fail immediately like in master?
On the master question, yes, something changed. See our pre-release notes: https://github.com/allenai/allennlp/releases/tag/v1.0-prerelease, currently the third bullet point under "config file changes".
Thank you very much, adding the token_character vocab_namespace fixed my issues. Regarding the problem with 0.9.0, this could very well a problem on my side, as I'm trying to predict a json string, and even though I escaped characters, putting a json string into a json format for prediction might lead to some issues. I assume this can be closed.
Glad you got it working!
System
Hi, I was trying to run the copynet tutorial as mentioned here: https://medium.com/@epwalsh10/incorporating-a-copy-mechanism-into-sequence-to-sequence-models-40917280b89d
As the config in the linked repo was not up to date, I cobbled together:
Trying to run it with 0.9.0 I encountered the problem mentioned here: https://github.com/allenai/allennlp/issues/3455 . So I cloned master and tried running this version but kept on getting this error message:
I had a look at the pre-release log, and as a noobie to allennlp there I did not see any breaking issues with my config.
Following this, I checked out v.0.9.0 and implemented the aforementioned fix to argmax cuda. This led to a training of 1 epoch, but as the validation metrics was computed I got following:
Similar happens if you choose bleu as a metric.
EDIT: After replacing the metric with "token_sequence_accuracy" as in the original tutorial and adding
The code v.0.9.0 + cuda fix runs now So I assume there is a mismatch with whatever the is generated by the dataset loader and the what the implemented metrics expect. Alternatively removing the this field also works and the looking at the output BLEU and sequence_accuracy are still run.
I'm not sure what is the problem with the master branch. My config? I could not find a current copynet config in the training_configs folder. Or is there something wrong with the way the copynet dataset loader works??
Regards