allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.75k stars 2.25k forks source link

[Suggestion] Add parameter `mini_width` or something else to `TokenCharactersIndexer`'s initializer. #1954

Closed WrRan closed 5 years ago

WrRan commented 5 years ago

Describe the bug I used a simple CNN on a text classification tasks. It worked well when training, but it broke when predicting.

To Reproduce For convenience, I provide a test case at allennlp-demo-producing-bug. Steps to reproduce the behavior

  1. Clone the repository to local dir: git clone https://github.com/WrRan/allennlp-demo-producing-bug.git demo
  2. Go to demo: cd demo
  3. Begin training:
    export PYTHONPATH=`pwd` && export CUDA_VISIBLE_DEVICES=0
    allennlp train config/cnn_classifier.json -s data/model/cnn --include-package tc
  4. Begin predicting:
    allennlp predict data/model/cnn/model.tar.gz data/data.txt --include-package tc --predictor sentiment_classifier --use-dataset-reader
  5. See error:
    Traceback (most recent call last):
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/run.py", line 20, in <module>
    main(prog="allennlp")
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 70, in main
    args.func(args)
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/predict.py", line 192, in _predict
    manager.run()
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/predict.py", line 168, in run
    for result in self._predict_instances(batch):
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/predict.py", line 137, in _predict_instances
    results = [self._predictor.predict_instance(batch_data[0])]
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/predictors/predictor.py", line 53, in predict_instance
    outputs = self._model.forward_on_instance(instance)
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/models/model.py", line 124, in forward_on_instance
    return self.forward_on_instances([instance])[0]
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/models/model.py", line 155, in forward_on_instances
    outputs = self.decode(self(**model_input))
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
    File "/home/user_data/wangr/ws/demo/tc/cnn_classifier.py", line 58, in forward
    embedded_text = self._dropout(self.text_field_embedder(text))
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 88, in forward
    token_vectors = embedder(*tensors)
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/token_embedders/token_characters_encoder.py", line 36, in forward
    return self._dropout(self._encoder(self._embedding(token_characters), mask))
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 35, in forward
    reshaped_outputs = self._module(*reshaped_inputs)
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/seq2vec_encoders/cnn_encoder.py", line 106, in forward
    self._activation(convolution_layer(tokens)).max(dim=2)[0]
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
    File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 176, in forward
    self.padding, self.dilation, self.groups)
    RuntimeError: Calculated padded input size per channel: (1 x 2). Kernel size: (1 x 3). Kernel size can't greater than actual input size at ...

Expected behavior Because testdata is same as traindata, it is unexpected to get a RuntimeError. (Moreover, the predictor used in allennlp predict is trivial.)

System (please complete the following information):

Additional context I think the error comes from the TokenCharactersIndexer under some edge cases and default settings. The error is produced by the settings and a special datapoint: The cnn_encoders with the setting:

          "ngram_filter_sizes": [
            2,3,4
          ],

requires the word consisting of no less than 4 characters.

configs: https://github.com/WrRan/allennlp-demo-producing-bug/blob/685dcffd3755da60fe3fb360aa6f8338571ce86d/config/cnn_classifier.json#L49-L51

However, there is an unexpected datapoint: I I I I I I I.. Thus, prediction is broken.

But training process works well. Why? It is in that batch_size is set as 64, and the sentence I I I I I I I. is feed to model with another good sentence The allennlp is awesome.. Under such case, class TokenCharactersIndexer produces padding_lengths of 7 (which is the length of the word awesome in the batch).

Suggestions I think the key is the design of class TokenCharactersIndexer which produces padding_lengths just taking consideration of single one datapoint. (in method get_padding_lengths). So it may be a good idea to add a parameter mini_width or something else to TokenCharactersIndexer's initializer. In that case, we can config like this:

    "token_indexers": {
      "token_characters": {
        "type": "characters",
        "mini_width": 4
      }
    }

and the class TokenCharactersIndexer may work like this:

class TokenCharactersIndexer(TokenIndexer[List[int]]):
    def __init__(self,
                 mini_width: int,
                 namespace: str = 'token_characters',
                 character_tokenizer: CharacterTokenizer = CharacterTokenizer()) -> None:
        self._mini_width = mini_width
        self._namespace = namespace
        self._character_tokenizer = character_tokenizer

    def get_padding_lengths(self, token: List[int]) -> Dict[str, int]:
        return {"num_token_characters": max(len(token), self.mini_width)}

# other stuff

Above solution copes with my issue. And I think no default value of mini_width is a better idea. Does this make sense?

Look forward to hearing from you. Thanks.

matt-gardner commented 5 years ago

Yes, we've hit this issue multiple times, and I think adding a min_width parameter where you suggest is a good idea. PR welcome. I'd call it min_width or min_num_characters, not mini_width, and give it a default value of 0.

WrRan commented 5 years ago

I am working on it. I think min_padding_length is a better name. And it may be a better idea to provide no default value for this parameter. Because no default value makes users check the maximum value of ngram_filter_sizes (in cnn_encoder) equals the value of min_padding_length. Otherwise, this subtle issue may be hit again after "usual" training process. What do you think? @matt-gardner