allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.77k stars 2.25k forks source link

CopyNet config issues #3803

Closed Sileadim closed 4 years ago

Sileadim commented 4 years ago

System

Hi, I was trying to run the copynet tutorial as mentioned here: https://medium.com/@epwalsh10/incorporating-a-copy-mechanism-into-sequence-to-sequence-models-40917280b89d

As the config in the linked repo was not up to date, I cobbled together:

{
  "dataset_reader": {
    "target_namespace": "target_tokens",
    "type": "copynet_seq2seq",
    "source_token_indexers": {
      "tokens": {
        "type": "single_id",
        "namespace": "source_tokens"
      },
      "token_characters": {
        "type": "characters"
      }
    }
  },
  "vocabulary": {
    "min_count": {
      "source_tokens": 4,
      "target_tokens": 4
    }
  },
  "train_data_path": "data/greetings/train.tsv",
  "validation_data_path": "data/greetings/validation.tsv",
  "model": {
    "type": "copynet_seq2seq",
      "source_embedder": {
      "token_embedders": {
        "tokens": {
          "type": "embedding",
          "vocab_namespace": "source_tokens",
          "embedding_dim": 25,
          "trainable": true
        },
        "token_characters": {
          "type": "character_encoding",
          "embedding": {
            "embedding_dim": 10
          },
          "encoder": {
            "type": "lstm",
            "input_size": 10,
            "hidden_size": 10,
            "num_layers": 2,
            "dropout": 0,
            "bidirectional": true
          }
        }
      }
    },
    "encoder": {
      "type": "lstm",
      "input_size": 45,
      "hidden_size": 100,
      "num_layers": 2,
      "dropout": 0,
      "bidirectional": true
    },
    "attention": {
      "type": "bilinear",
      "vector_dim": 200,
      "matrix_dim": 200
    },
    "target_embedding_dim": 10,
    "beam_size": 3,
    "max_decoding_steps": 20,
    "token_based_metric": {
      "type": "sequence_accuracy"
    }
  },
  "iterator": {
    "type": "bucket",
    "padding_noise": 0.0,
    "batch_size" : 32,
    "sorting_keys": [ ["source_tokens", "num_tokens"]]
  },
  "trainer": {
    "optimizer": {
      "type": "sgd",
      "lr": 0.1
    },
    "learning_rate_scheduler": {
      "type": "cosine",
      "t_initial": 5,
      "eta_mul": 0.9
    },
    "num_epochs": 10,
    "cuda_device": 0,
    "should_log_learning_rate": true,
    "should_log_parameter_statistics": false
  }
}
  1. Issue:

Trying to run it with 0.9.0 I encountered the problem mentioned here: https://github.com/allenai/allennlp/issues/3455 . So I cloned master and tried running this version but kept on getting this error message:

  File "/home/cehmann/repos/allennlp/env/bin/allennlp", line 11, in <module>
    load_entry_point('allennlp', 'console_scripts', 'allennlp')()
  File "/home/cehmann/repos/allennlp/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/home/cehmann/repos/allennlp/allennlp/commands/__init__.py", line 94, in main
    args.func(args)
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 134, in train_model_from_args
    include_package=args.include_package,
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 183, in train_model_from_file
    include_package=include_package,
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 241, in train_model
    batch_weight_key=batch_weight_key,
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 420, in _train_worker
    metrics = train_loop.run()
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 481, in run
    return self.trainer.train()
  File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 557, in train
    train_metrics = self._train_epoch(epoch)
  File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 363, in _train_epoch
    for batch_group in batch_group_generator_tqdm:
  File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/tqdm/std.py", line 1107, in __iter__
    for obj in iterable:
  File "/home/cehmann/repos/allennlp/allennlp/common/util.py", line 134, in lazy_groups_of
    s = list(islice(iterator, group_size))
  File "/home/cehmann/repos/allennlp/allennlp/data/iterators/data_iterator.py", line 145, in __call__
    for batch in batches:
  File "/home/cehmann/repos/allennlp/allennlp/data/iterators/bucket_iterator.py", line 96, in _create_batches
    instance_list = self._sort_by_padding(instance_list)
  File "/home/cehmann/repos/allennlp/allennlp/data/iterators/bucket_iterator.py", line 156, in _sort_by_padding
    for (field_name, padding_key) in self._sorting_keys
  File "/home/cehmann/repos/allennlp/allennlp/data/iterators/bucket_iterator.py", line 156, in <listcomp>
    for (field_name, padding_key) in self._sorting_keys
KeyError: 'num_tokens'
  0%|         

I had a look at the pre-release log, and as a noobie to allennlp there I did not see any breaking issues with my config.

  1. Issue:

Following this, I checked out v.0.9.0 and implemented the aforementioned fix to argmax cuda. This led to a training of 1 epoch, but as the validation metrics was computed I got following:

loss: 10.3304 ||: 100%|##########| 157/157 [00:06<00:00, 25.44it/s]
2020-02-18 15:35:33,868 - INFO - allennlp.training.trainer - Validating
  0%|          | 0/16 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/cehmann/repos/allennlp/env/bin/allennlp", line 11, in <module>
    load_entry_point('allennlp', 'console_scripts', 'allennlp')()
  File "/home/cehmann/repos/allennlp/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/home/cehmann/repos/allennlp/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 124, in train_model_from_args
    args.cache_prefix)
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 168, in train_model_from_file
    cache_directory, cache_prefix)
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 491, in train
    val_loss, num_batches = self._validation_loss()
  File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 428, in _validation_loss
    loss = self.batch_loss(batch_group, for_training=False)
  File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 261, in batch_loss
    output_dict = self.model(**batch)
  File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/repos/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 222, in forward
    [x["target_tokens"] for x in metadata])
  File "/home/cehmann/repos/allennlp/allennlp/training/metrics/sequence_accuracy.py", line 37, in __call__
    if gold_labels.dim() != predictions.dim() - 1:
AttributeError: 'list' object has no attribute 'dim'
  0%|          | 0/16 [00:00<?, ?it/s]

Similar happens if you choose bleu as a metric.

EDIT: After replacing the metric with "token_sequence_accuracy" as in the original tutorial and adding

--include-package nlpete.training.metrics

The code v.0.9.0 + cuda fix runs now So I assume there is a mismatch with whatever the is generated by the dataset loader and the what the implemented metrics expect. Alternatively removing the this field also works and the looking at the output BLEU and sequence_accuracy are still run.

I'm not sure what is the problem with the master branch. My config? I could not find a current copynet config in the training_configs folder. Or is there something wrong with the way the copynet dataset loader works??

Regards

matt-gardner commented 4 years ago

Thanks for pointing out that you couldn't find anything in the release notes - I just updated them. The problem with master is that you need to remove the sorting_keys parameter from your config file.

Sileadim commented 4 years ago

Hi thanks for the quick reponse, but just removing sorting_keys leads to this (at least with driver 430.50, cuda 10.1 and pytorch 1.4.0):

2020-02-18 16:27:18,268 - INFO - allennlp.data.iterators.bucket_iterator - No sorting keys given; trying to guess a good one
2020-02-18 16:27:18,671 - INFO - allennlp.data.iterators.bucket_iterator - Using [('source_tokens', 'token_characters___num_token_characters')] as the sorting keys
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [28,0,0], thread: [96,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [28,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [64,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [65,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [66,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [67,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [68,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [10,0,0], thread: [69,0,0] Assertion `srcIndex < srcSe
lectDimSize` failed.
Traceback (most recent call last):
  File "/home/cehmann/repos/allennlp/env/bin/allennlp", line 11, in <module>
    load_entry_point('allennlp', 'console_scripts', 'allennlp')()
  File "/home/cehmann/repos/allennlp/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/home/cehmann/repos/allennlp/allennlp/commands/__init__.py", line 93, in main
    args.func(args)
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 144, in train_model_from_args
    dry_run=args.dry_run,
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 203, in train_model_from_file
    dry_run=dry_run,
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 266, in train_model
    dry_run=dry_run,
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 460, in _train_worker
    metrics = train_loop.run()
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 521, in run
    return self.trainer.train()
  File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 612, in train
    train_metrics = self._train_epoch(epoch)
  File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 391, in _train_epoch
    loss = self.batch_loss(batch, for_training=True)
  File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 302, in batch_loss
    output_dict = self._pytorch_model(**batch)
  File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/repos/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 199, in forward
    state = self._encode(source_tokens)
  File "/home/cehmann/repos/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 317, in _encode
    embedded_input = self._source_embedder(source_tokens)
  File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/repos/allennlp/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 83, in forward
    token_vectors = embedder(list(tensors.values())[0], **forward_params_values)
  File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/repos/allennlp/allennlp/modules/token_embedders/token_characters_encoder.py", line 35, in forward
    return self._dropout(self._encoder(self._embedding(token_characters), mask))
  File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/repos/allennlp/allennlp/modules/time_distributed.py", line 52, in forward
    reshaped_outputs = self._module(*reshaped_inputs, **reshaped_kwargs)
  File "/home/cehmann/repos/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/repos/allennlp/allennlp/modules/seq2vec_encoders/pytorch_seq2vec_wrapper.py", line 76, in forward
    self._module, inputs, mask, hidden_state
  File "/home/cehmann/repos/allennlp/allennlp/modules/encoder_base.py", line 97, in sort_and_run_forward
    num_valid = torch.sum(mask[:, 0]).int().item()
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [0,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [1,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [2,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [3,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [4,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [5,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [8,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [9,0,0] Assertion `srcIndex < srcSel
ectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, Ind
exType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [16,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
  0%|                                                                                         | 0/157 [00:00<?, ?it/s]
matt-gardner commented 4 years ago

Hmm, that one I don't know an answer to without a lot more digging. I don't know why that line that's just summing the mask would trigger that CUDA error, unless either (1) the mask is empty, or (2) the CUDA error is actually from a different line.

Sileadim commented 4 years ago

Probably related to this: https://github.com/pytorch/pytorch/issues/1204

Sileadim commented 4 years ago

Update: I continued working with the v0.9.0 + hotfix branch, but at some test example I ran into following issue (after disabling cuda):


Traceback (most recent call last):
  File "/home/cehmann/allennlp/env/bin/allennlp", line 11, in <module>
    load_entry_point('allennlp', 'console_scripts', 'allennlp')()
  File "/home/cehmann/allennlp/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/home/cehmann/allennlp/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/home/cehmann/allennlp/allennlp/commands/predict.py", line 227, in _predict
    manager.run()
  File "/home/cehmann/allennlp/allennlp/commands/predict.py", line 206, in run
    for model_input_json, result in zip(batch_json, self._predict_json(batch_json)):
  File "/home/cehmann/allennlp/allennlp/commands/predict.py", line 151, in _predict_json
    results = [self._predictor.predict_json(batch_data[0])]
  File "/home/cehmann/allennlp/allennlp/predictors/predictor.py", line 65, in predict_json
    return self.predict_instance(instance)
  File "/home/cehmann/allennlp/allennlp/predictors/predictor.py", line 181, in predict_instance
    outputs = self._model.forward_on_instance(instance)
  File "/home/cehmann/allennlp/allennlp/models/model.py", line 124, in forward_on_instance
    return self.forward_on_instances([instance])[0]
  File "/home/cehmann/allennlp/allennlp/models/model.py", line 153, in forward_on_instances
    outputs = self.decode(self(**model_input))
  File "/home/cehmann/allennlp/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 204, in forward
    predictions = self._forward_beam_search(state)
  File "/home/cehmann/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 519, in _forward_beam_search
    start_predictions, state, self.take_search_step)
  File "/home/cehmann/allennlp/allennlp/nn/beam_search.py", line 111, in search
    start_class_log_probabilities, state = step(start_predictions, start_state)
  File "/home/cehmann/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 745, in take_search_step
    input_choices, selective_weights = self._get_input_and_selective_weights(last_predictions, state)
  File "/home/cehmann/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 590, in _get_input_and_selective_weights
    adjusted_prediction_ids = source_token_ids.gather(-1, adjusted_predictions.unsqueeze(-1))
RuntimeError: Invalid index in gather at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:657

After this setback, I tried to do some more digging on why master was not working. After disabling cuda there I got:

2020-02-20 17:30:55,599 - INFO - allennlp.data.iterators.bucket_iterator - No sorting keys given; trying to guess a good one
  0%|          | 0/1250 [00:00<?, ?it/s]2020-02-20 17:30:56,662 - INFO - allennlp.data.iterators.bucket_iterator - Using [('source_tokens', 'token_characters___num_token_characters')] as the sorting keys
Traceback (most recent call last):
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 460, in _train_worker
    metrics = train_loop.run()
  File "/home/cehmann/repos/allennlp/allennlp/commands/train.py", line 521, in run
    return self.trainer.train()
  File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 612, in train
    train_metrics = self._train_epoch(epoch)
  File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 391, in _train_epoch
    loss = self.batch_loss(batch, for_training=True)
  File "/home/cehmann/repos/allennlp/allennlp/training/trainer.py", line 302, in batch_loss
    output_dict = self._pytorch_model(**batch)
  File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/repos/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 199, in forward
    state = self._encode(source_tokens)
  File "/home/cehmann/repos/allennlp/allennlp/models/encoder_decoders/copynet_seq2seq.py", line 317, in _encode
    embedded_input = self._source_embedder(source_tokens)
  File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/repos/allennlp/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 83, in forward
    token_vectors = embedder(list(tensors.values())[0], **forward_params_values)
  File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/repos/allennlp/allennlp/modules/token_embedders/token_characters_encoder.py", line 36, in forward
    embedded = self._embedding(token_characters)
  File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/repos/allennlp/allennlp/modules/time_distributed.py", line 52, in forward
    reshaped_outputs = self._module(*reshaped_inputs, **reshaped_kwargs)
  File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cehmann/repos/allennlp/allennlp/modules/token_embedders/embedding.py", line 186, in forward
    sparse=self.sparse,
  File "/home/cehmann/repos/allennlp/master/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range: Tried to access index 2 out of table with 1 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418
  0%|          | 0/1250 [00:04<?, ?it/s]

Apparently there is some wrong index in the embedding, so I disabled character embeddings and this made the code run. So I'm wondering:

What changed from 0.9.0 -> master regarding the character embeddings? And is the problem with the test example from 0.9.0 related to the problem in master? If so, why would the training run in 0.9.0 and not fail immediately like in master?

matt-gardner commented 4 years ago

On the master question, yes, something changed. See our pre-release notes: https://github.com/allenai/allennlp/releases/tag/v1.0-prerelease, currently the third bullet point under "config file changes".

Sileadim commented 4 years ago

Thank you very much, adding the token_character vocab_namespace fixed my issues. Regarding the problem with 0.9.0, this could very well a problem on my side, as I'm trying to predict a json string, and even though I escaped characters, putting a json string into a json format for prediction might lead to some issues. I assume this can be closed.

matt-gardner commented 4 years ago

Glad you got it working!