allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

[Bug] It fails when using wordsplit with pos_tag or ner_tag. #1929

Closed WrRan closed 6 years ago

WrRan commented 6 years ago

Describe the bug When I train bidaf with pos_tag or ner_tag, it fails to begin training.

To Reproduce Steps to reproduce the behavior

  1. Copy training_config/bidaf.jsonnet to custom directory.
  2. Modify bidaf.jsonnet as follow: a) change toke_indexers of dataset_reader
    "token_indexers": {
      // config `tokens` and `token_characters` as before ...
      "pos_tag": {
        "type": "pos_tag"
      },
      "ner_tag": {
        "type": "ner_tag"
      }
    }

    b) change text_field_embedder of model:

    "text_field_embedder": {
       // config token_embedders here ...
      "allow_unmatched_keys": true
    },
  3. apply command: allennlp train config/bidaf.jsonnet -s data/bidaf
  4. See error as follow:
    KeyError: '@@UNKNOWN@@'

    or

    RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Expected behavior No error and same behaviour as before.

System (please complete the following information):

Additional context I find there are three code snippets that may be related to this bug. I am trying to fix it.

matt-gardner commented 6 years ago

If you want to use POS and NER tags, you need to be sure you also modify the tokenizer to produce these tags. The default tokenizer for BiDAF only splits words, it doesn't run any tagger on them (that would dramatically slow down the data processing).

The particular errors that you give are hard to diagnose, however, as you're not giving a complete stack trace. If you give us more detail, we'd be able to help you better.

WrRan commented 6 years ago

Sorry for my carelessness. I forgot to provide the config of tokenizer. For convenience, I provide the configuation as follow:

{
  "dataset_reader": {
    "type": "squad",
    "tokenizer": {
      "type": "word",
      "word_splitter": {
        "type": "spacy",
        "pos_tags": true,
        "ner": true
      }
    },
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true
      },
      "token_characters": {
        "type": "characters",
        "character_tokenizer": {
          "byte_encoding": "utf-8",
          "start_tokens": [259],
          "end_tokens": [260]
        }
      },
      "pos_tag": {
        "type": "pos_tag"
      },
      "ner_tag": {
        "type": "ner_tag"
      }
    }
  },
  // To process the data as quickly as possible, it is recommended to use *test data* instead of the default settings.
  "train_data_path": "https://s3-us-west-2.amazonaws.com/allennlp/datasets/squad/squad-train-v1.1.json",
  "validation_data_path": "https://s3-us-west-2.amazonaws.com/allennlp/datasets/squad/squad-dev-v1.1.json",
  "model": {
    "type": "bidaf",
    "text_field_embedder": {
      "token_embedders": {
          "tokens": {
              "type": "embedding",
              "embedding_dim": 100,
              "trainable": false
          },
          "token_characters": {
              "type": "character_encoding",
              "embedding": {
                "num_embeddings": 262,
                "embedding_dim": 16
              },
              "encoder": {
                "type": "cnn",
                "embedding_dim": 16,
                "num_filters": 100,
                "ngram_filter_sizes": [5]
              },
              "dropout": 0.2
          }
      },
      "allow_unmatched_keys": true
    },
    "num_highway_layers": 2,
    "phrase_layer": {
      "type": "lstm",
      "bidirectional": true,
      "input_size": 200,
      "hidden_size": 100,
      "num_layers": 1,
      "dropout": 0.2
    },
    "similarity_function": {
      "type": "linear",
      "combination": "x,y,x*y",
      "tensor_1_dim": 200,
      "tensor_2_dim": 200
    },
    "modeling_layer": {
      "type": "lstm",
      "bidirectional": true,
      "input_size": 800,
      "hidden_size": 100,
      "num_layers": 2,
      "dropout": 0.2
    },
    "span_end_encoder": {
      "type": "lstm",
      "bidirectional": true,
      "input_size": 1400,
      "hidden_size": 100,
      "num_layers": 1,
      "dropout": 0.2
    },
    "dropout": 0.2
  },
  "iterator": {
    "type": "bucket",
    "sorting_keys": [["passage", "num_tokens"], ["question", "num_tokens"]],
    "batch_size": 40
  },

  "trainer": {
    "num_epochs": 20,
    "grad_norm": 5.0,
    "patience": 10,
    "validation_metric": "+em",
    "cuda_device": -1,
    "learning_rate_scheduler": {
      "type": "reduce_on_plateau",
      "factor": 0.5,
      "mode": "max",
      "patience": 2
    },
    "optimizer": {
      "type": "adam",
      "betas": [0.9, 0.9]
    }
  }
}
matt-gardner commented 6 years ago

Ok, good to see that you updated the tokenizer. We're still going to need a complete stack trace to have any hope of helping you.

WrRan commented 6 years ago

Thanks for your reply and concern. :smiley: When applying the command allennlp train config/bidaf.json -s data/bidaf, I got the exception:

Traceback (most recent call last):
  File "/home/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/run.py", line 18, in <module>
    main(prog="allennlp")
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 70, in main
    args.func(args)
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/commands/train.py", line 101, in train_model_from_args
    args.recover)
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/commands/train.py", line 131, in train_model_from_file
    return train_model(params, serialization_dir, file_friendly_logging, recover)
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/commands/train.py", line 321, in train_model
    metrics = trainer.train()
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/training/trainer.py", line 749, in train
    train_metrics = self._train_epoch(epoch)
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/training/trainer.py", line 482, in _train_epoch
    for batch in train_generator_tqdm:
  File "/home/root/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py", line 937, in __iter__
    for obj in iterable:
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/iterators/data_iterator.py", line 143, in __call__
    for batch in batches:
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/iterators/bucket_iterator.py", line 116, in _create_batches
    self._padding_noise)
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/iterators/bucket_iterator.py", line 28, in sort_by_padding
    instance.index_fields(vocab)
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/instance.py", line 60, in index_fields
    field.index(vocab)
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/fields/text_field.py", line 58, in index
    token_indices = indexer.tokens_to_indices(self.tokens, vocab, indexer_name)
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/token_indexers/pos_tag_indexer.py", line 64, in tokens_to_indices
    return {index_name: [vocabulary.get_token_index(tag, self._namespace) for tag in tags]}
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/token_indexers/pos_tag_indexer.py", line 64, in <listcomp>
    return {index_name: [vocabulary.get_token_index(tag, self._namespace) for tag in tags]}
  File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/vocabulary.py", line 591, in get_token_index
    return self._token_to_index[namespace][self._oov_token]
KeyError: '@@UNKNOWN@@'

I found this exception to be caused by the following code snippet: allennlp/data/token_indexers/pos_tag_indexer.py:

class PosTagIndexer(TokenIndexer[int]):
    def count_vocab_items(self, token: Token, counter: Dict[str, Dict[str, int]]):
        if self._coarse_tags:
            tag = token.pos_
        else:
            tag = token.tag_
        if not tag:
            if token.text not in self._logged_errors:
                logger.warning("Token had no POS tag: %s", token.text)
                self._logged_errors.add(token.text)
            tag = 'NONE'
        counter[self._namespace][tag] += 1

    def tokens_to_indices(self,
                          tokens: List[Token],
                          vocabulary: Vocabulary,
                          index_name: str) -> Dict[str, List[int]]:
        tags: List[str] = []

        for token in tokens:
            if self._coarse_tags:
                tag = token.pos_
            else:
                tag = token.tag_
            if tag is None:
                tag = 'NONE'

            tags.append(tag)

        return {index_name: [vocabulary.get_token_index(tag, self._namespace) for tag in tags]}

# other codes here...

In this code snippet, we treat not tag as 'NONE' (which contains None and a empty sting '') in count_vocab_items; however, in tokens_to_indices, we just treat tag is None as 'NONE' (which does not contain the empty string ''). Based on above observation, the fix to this bug is:

class PosTagIndexer(TokenIndexer[int]):
    def tokens_to_indices(self,
                          tokens: List[Token],
                          vocabulary: Vocabulary,
                          index_name: str) -> Dict[str, List[int]]:
        tags: List[str] = []

        for token in tokens:
            if self._coarse_tags:
                tag = token.pos_
            else:
                tag = token.tag_
            # instead of `if tag is None:`
            if not tag:
                tag = 'NONE'

            tags.append(tag)

        return {index_name: [vocabulary.get_token_index(tag, self._namespace) for tag in tags]}

# other codes here...

Similarly, the same modification should be applied to class NerTagIndexer:

class NerTagIndexer(TokenIndexer[int]):
    def tokens_to_indices(self,
                          tokens: List[Token],
                          vocabulary: Vocabulary,
                          index_name: str) -> Dict[str, List[int]]:
        # instead of `tag = ['None' if token.ent_type_ is None else token.ent_type_ for token in tokens]`
        tags = ['NONE' if not token.ent_type_ else token.ent_type_ for token in tokens]

        return {index_name: [vocabulary.get_token_index(tag, self._namespace) for tag in tags]}

However, there is still a subtle bug in the method allennlp/nn/util.py#get_text_field_mask. Taking the sentence "The allennlp is awesome." as an example, we will get the tokens as follow:

# text/pos_tag/ner_tag
The/DET/NONE
allennlp/NOUN/NONE
is/VERB/NONE
awesome/ADJ/NONE
./PUNCT/NONE

In the method get_text_field_mask, we compute mask according to just one text_field. When the text_field is ner_tag, we get the wrong mask: [0.0, 0.0, 0.0, 0.0, 0.0].

I think the following code snippet could be helpful for this issue:

def get_text_field_mask(text_field_tensors: Dict[str, torch.Tensor],
                        num_wrapping_dims: int=0) -> torch.LongTensor:
    if "mask" in text_field_tensors:
        return text_field_tensors["mask"]

    tensor_dims = [(tensor.dim(), tensor) for tensor in text_field_tensors.values()]
    tensor_dims.sort(key=lambda x: x[0])

    smallest_dim = tensor_dims[0][0] - num_wrapping_dims
    if smallest_dim == 2:
        token_tensor = tensor_dims[0][1]
        for ix in range(1, len(tensor_dims)):
            if smallest_dim != tensor_dims[ix][0] - num_wrapping_dims:
                break
            token_tensor += tensor_dims[ix][1]
        return (token_tensor != 0).long()
    elif smallest_dim == 3:
        character_tensor = tensor_dims[0][1]
        for ix in range(1, len(tensor_dims)):
            if smallest_dim != tensor_dims[ix][0] - num_wrapping_dims:
                break
            character_tensor += tensor_dims[ix][1]
        return ((character_tensor > 0).long().sum(dim=-1) > 0).long()
    else:
        raise ValueError("Expected a tensor with dimension 2 or 3, found {}".format(smallest_dim))
WrRan commented 6 years ago

By the way, the tokenizer with some taggers in the data processing is too slow. And since it is impossible in alennlp to cache pre-processed data at now (which may be related to #1887 ), I have to wait for a long time when executing command allennlp train every time. So painful. :sob:

joelgrus commented 6 years ago

it's maybe not ideal, but it shouldn't be hard to write a script to one-time preprocess your data into some simple format like

The###POS_TAG###NER_TAG cat###POS_TAG###NER_TAG ate##POS_TAG##NER_TAG
My###POS_TAG###NER_TAG 

(or whatever works for you) and then change your dataset reader to deal with that format, so that you only have to do the tagging once.

matt-gardner commented 6 years ago

Yes, we do this for a few of our dataset readers: https://github.com/allenai/allennlp/blob/3e2d7959efa1704f8b22ed0601b8d72eed52937d/allennlp/data/dataset_readers/semantic_parsing/wikitables/wikitables.py#L34-L41

WrRan commented 6 years ago

Thank for @matt-gardner and @joelgrus. This helps me a lot. :+1:

SparkJiao commented 5 years ago

Is this problem has been solved? @WrRan Thank you

WrRan commented 5 years ago

Yes. I think so. @SparkJiao If you have problems, please let me know.

SparkJiao commented 5 years ago

@WrRan Well, I believe you have fix the bug but I have found the code in pos_tag_indexer.py is the same as before after 'pip install --upgrade allennlp'. So maybe the version is still not released?

WrRan commented 5 years ago

I have no idea about this issue. @matt-gardner may offer some suggestions.

SparkJiao commented 5 years ago

@WrRan I have reinstalled the allennlp from github. Very appreciate to your and allenAI's contribution. I think the fix has solved my problem. Thank u very much