Closed WrRan closed 6 years ago
If you want to use POS and NER tags, you need to be sure you also modify the tokenizer to produce these tags. The default tokenizer for BiDAF only splits words, it doesn't run any tagger on them (that would dramatically slow down the data processing).
The particular errors that you give are hard to diagnose, however, as you're not giving a complete stack trace. If you give us more detail, we'd be able to help you better.
Sorry for my carelessness. I forgot to provide the config of tokenizer
.
For convenience, I provide the configuation as follow:
{
"dataset_reader": {
"type": "squad",
"tokenizer": {
"type": "word",
"word_splitter": {
"type": "spacy",
"pos_tags": true,
"ner": true
}
},
"token_indexers": {
"tokens": {
"type": "single_id",
"lowercase_tokens": true
},
"token_characters": {
"type": "characters",
"character_tokenizer": {
"byte_encoding": "utf-8",
"start_tokens": [259],
"end_tokens": [260]
}
},
"pos_tag": {
"type": "pos_tag"
},
"ner_tag": {
"type": "ner_tag"
}
}
},
// To process the data as quickly as possible, it is recommended to use *test data* instead of the default settings.
"train_data_path": "https://s3-us-west-2.amazonaws.com/allennlp/datasets/squad/squad-train-v1.1.json",
"validation_data_path": "https://s3-us-west-2.amazonaws.com/allennlp/datasets/squad/squad-dev-v1.1.json",
"model": {
"type": "bidaf",
"text_field_embedder": {
"token_embedders": {
"tokens": {
"type": "embedding",
"embedding_dim": 100,
"trainable": false
},
"token_characters": {
"type": "character_encoding",
"embedding": {
"num_embeddings": 262,
"embedding_dim": 16
},
"encoder": {
"type": "cnn",
"embedding_dim": 16,
"num_filters": 100,
"ngram_filter_sizes": [5]
},
"dropout": 0.2
}
},
"allow_unmatched_keys": true
},
"num_highway_layers": 2,
"phrase_layer": {
"type": "lstm",
"bidirectional": true,
"input_size": 200,
"hidden_size": 100,
"num_layers": 1,
"dropout": 0.2
},
"similarity_function": {
"type": "linear",
"combination": "x,y,x*y",
"tensor_1_dim": 200,
"tensor_2_dim": 200
},
"modeling_layer": {
"type": "lstm",
"bidirectional": true,
"input_size": 800,
"hidden_size": 100,
"num_layers": 2,
"dropout": 0.2
},
"span_end_encoder": {
"type": "lstm",
"bidirectional": true,
"input_size": 1400,
"hidden_size": 100,
"num_layers": 1,
"dropout": 0.2
},
"dropout": 0.2
},
"iterator": {
"type": "bucket",
"sorting_keys": [["passage", "num_tokens"], ["question", "num_tokens"]],
"batch_size": 40
},
"trainer": {
"num_epochs": 20,
"grad_norm": 5.0,
"patience": 10,
"validation_metric": "+em",
"cuda_device": -1,
"learning_rate_scheduler": {
"type": "reduce_on_plateau",
"factor": 0.5,
"mode": "max",
"patience": 2
},
"optimizer": {
"type": "adam",
"betas": [0.9, 0.9]
}
}
}
Ok, good to see that you updated the tokenizer. We're still going to need a complete stack trace to have any hope of helping you.
Thanks for your reply and concern. :smiley:
When applying the command allennlp train config/bidaf.json -s data/bidaf
, I got the exception:
Traceback (most recent call last):
File "/home/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/run.py", line 18, in <module>
main(prog="allennlp")
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 70, in main
args.func(args)
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/commands/train.py", line 101, in train_model_from_args
args.recover)
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/commands/train.py", line 131, in train_model_from_file
return train_model(params, serialization_dir, file_friendly_logging, recover)
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/commands/train.py", line 321, in train_model
metrics = trainer.train()
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/training/trainer.py", line 749, in train
train_metrics = self._train_epoch(epoch)
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/training/trainer.py", line 482, in _train_epoch
for batch in train_generator_tqdm:
File "/home/root/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py", line 937, in __iter__
for obj in iterable:
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/iterators/data_iterator.py", line 143, in __call__
for batch in batches:
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/iterators/bucket_iterator.py", line 116, in _create_batches
self._padding_noise)
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/iterators/bucket_iterator.py", line 28, in sort_by_padding
instance.index_fields(vocab)
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/instance.py", line 60, in index_fields
field.index(vocab)
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/fields/text_field.py", line 58, in index
token_indices = indexer.tokens_to_indices(self.tokens, vocab, indexer_name)
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/token_indexers/pos_tag_indexer.py", line 64, in tokens_to_indices
return {index_name: [vocabulary.get_token_index(tag, self._namespace) for tag in tags]}
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/token_indexers/pos_tag_indexer.py", line 64, in <listcomp>
return {index_name: [vocabulary.get_token_index(tag, self._namespace) for tag in tags]}
File "/home/root/anaconda3/lib/python3.6/site-packages/allennlp/data/vocabulary.py", line 591, in get_token_index
return self._token_to_index[namespace][self._oov_token]
KeyError: '@@UNKNOWN@@'
I found this exception to be caused by the following code snippet: allennlp/data/token_indexers/pos_tag_indexer.py:
class PosTagIndexer(TokenIndexer[int]):
def count_vocab_items(self, token: Token, counter: Dict[str, Dict[str, int]]):
if self._coarse_tags:
tag = token.pos_
else:
tag = token.tag_
if not tag:
if token.text not in self._logged_errors:
logger.warning("Token had no POS tag: %s", token.text)
self._logged_errors.add(token.text)
tag = 'NONE'
counter[self._namespace][tag] += 1
def tokens_to_indices(self,
tokens: List[Token],
vocabulary: Vocabulary,
index_name: str) -> Dict[str, List[int]]:
tags: List[str] = []
for token in tokens:
if self._coarse_tags:
tag = token.pos_
else:
tag = token.tag_
if tag is None:
tag = 'NONE'
tags.append(tag)
return {index_name: [vocabulary.get_token_index(tag, self._namespace) for tag in tags]}
# other codes here...
In this code snippet, we treat not tag
as 'NONE'
(which contains None
and a empty sting ''
) in count_vocab_items
; however, in tokens_to_indices
, we just treat tag is None
as 'NONE'
(which does not contain the empty string ''
).
Based on above observation, the fix to this bug is:
class PosTagIndexer(TokenIndexer[int]):
def tokens_to_indices(self,
tokens: List[Token],
vocabulary: Vocabulary,
index_name: str) -> Dict[str, List[int]]:
tags: List[str] = []
for token in tokens:
if self._coarse_tags:
tag = token.pos_
else:
tag = token.tag_
# instead of `if tag is None:`
if not tag:
tag = 'NONE'
tags.append(tag)
return {index_name: [vocabulary.get_token_index(tag, self._namespace) for tag in tags]}
# other codes here...
Similarly, the same modification should be applied to class NerTagIndexer
:
class NerTagIndexer(TokenIndexer[int]):
def tokens_to_indices(self,
tokens: List[Token],
vocabulary: Vocabulary,
index_name: str) -> Dict[str, List[int]]:
# instead of `tag = ['None' if token.ent_type_ is None else token.ent_type_ for token in tokens]`
tags = ['NONE' if not token.ent_type_ else token.ent_type_ for token in tokens]
return {index_name: [vocabulary.get_token_index(tag, self._namespace) for tag in tags]}
However, there is still a subtle bug in the method allennlp/nn/util.py#get_text_field_mask.
Taking the sentence "The allennlp is awesome."
as an example, we will get the tokens as follow:
# text/pos_tag/ner_tag
The/DET/NONE
allennlp/NOUN/NONE
is/VERB/NONE
awesome/ADJ/NONE
./PUNCT/NONE
In the method get_text_field_mask
, we compute mask according to just one text_field
. When the text_field
is ner_tag
, we get the wrong mask: [0.0, 0.0, 0.0, 0.0, 0.0]
.
I think the following code snippet could be helpful for this issue:
def get_text_field_mask(text_field_tensors: Dict[str, torch.Tensor],
num_wrapping_dims: int=0) -> torch.LongTensor:
if "mask" in text_field_tensors:
return text_field_tensors["mask"]
tensor_dims = [(tensor.dim(), tensor) for tensor in text_field_tensors.values()]
tensor_dims.sort(key=lambda x: x[0])
smallest_dim = tensor_dims[0][0] - num_wrapping_dims
if smallest_dim == 2:
token_tensor = tensor_dims[0][1]
for ix in range(1, len(tensor_dims)):
if smallest_dim != tensor_dims[ix][0] - num_wrapping_dims:
break
token_tensor += tensor_dims[ix][1]
return (token_tensor != 0).long()
elif smallest_dim == 3:
character_tensor = tensor_dims[0][1]
for ix in range(1, len(tensor_dims)):
if smallest_dim != tensor_dims[ix][0] - num_wrapping_dims:
break
character_tensor += tensor_dims[ix][1]
return ((character_tensor > 0).long().sum(dim=-1) > 0).long()
else:
raise ValueError("Expected a tensor with dimension 2 or 3, found {}".format(smallest_dim))
By the way, the tokenizer with some taggers in the data processing is too slow. And since it is impossible in alennlp to cache pre-processed data at now (which may be related to #1887 ), I have to wait for a long time when executing command allennlp train
every time.
So painful. :sob:
it's maybe not ideal, but it shouldn't be hard to write a script to one-time preprocess your data into some simple format like
The###POS_TAG###NER_TAG cat###POS_TAG###NER_TAG ate##POS_TAG##NER_TAG
My###POS_TAG###NER_TAG
(or whatever works for you) and then change your dataset reader to deal with that format, so that you only have to do the tagging once.
Yes, we do this for a few of our dataset readers: https://github.com/allenai/allennlp/blob/3e2d7959efa1704f8b22ed0601b8d72eed52937d/allennlp/data/dataset_readers/semantic_parsing/wikitables/wikitables.py#L34-L41
Thank for @matt-gardner and @joelgrus. This helps me a lot. :+1:
Is this problem has been solved? @WrRan Thank you
Yes. I think so. @SparkJiao If you have problems, please let me know.
@WrRan Well, I believe you have fix the bug but I have found the code in pos_tag_indexer.py is the same as before after 'pip install --upgrade allennlp'. So maybe the version is still not released?
I have no idea about this issue. @matt-gardner may offer some suggestions.
@WrRan I have reinstalled the allennlp from github. Very appreciate to your and allenAI's contribution. I think the fix has solved my problem. Thank u very much
Describe the bug When I train bidaf with pos_tag or ner_tag, it fails to begin training.
To Reproduce Steps to reproduce the behavior
training_config/bidaf.jsonnet
to custom directory.bidaf.jsonnet
as follow: a) changetoke_indexers
ofdataset_reader
b) change
text_field_embedder
ofmodel
:allennlp train config/bidaf.jsonnet -s data/bidaf
or
Expected behavior No error and same behaviour as before.
System (please complete the following information):
Additional context I find there are three code snippets that may be related to this bug. I am trying to fix it.