coli-saar / am-parser

Modular implementation of an AM dependency parser in AllenNLP.
Apache License 2.0
30 stars 10 forks source link

Raw text AMR parser error #78

Closed RikVN closed 4 years ago

RikVN commented 4 years ago

Hi all, first of all great repo with good documentation. However, I couldn't get the raw text parser for AMR to work.

First bug I encountered was spacy_interface.py in the lemma_postprocess function. Where it says "return lemma_dict[lemma]" should be replaced by (I think) "return lemma_dict[token.lower()]"

Then I had a different error: the models errors because it encounters tokens that are not in vocab (e.g. CARDINAL, EVENT, PRODUCT, LAW) that are also not handled by ne_dict/ne_postprocess in spacy_interface.

Maybe this has to do with different version of the models? I'm using the one from https://coli-saar-data.s3.eu-central-1.amazonaws.com/raw_text_model.tar.gz.

Thanks for looking into this.

EDIT: trace for parsing the AMR dev set (tokenized sentences only)

python3 parse_raw_text.py downloaded_models/raw_text_model.tar.gz AMR-2017 AMR/dev.tok example//AMR-2017.amconll --cuda-device 0

Either spacy pytorch transformers or cupy not available, so you cannot use spacy-tok2vec! This is only an issue, if you intend to use roberta or xlnet. 0it [00:00, ?it/s]Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary. See documentation for non_padded_namespaces parameter in Vocabulary. Your label namespace was 'lemmas'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary. See documentation for non_padded_namespaces parameter in Vocabulary. 1368it [00:00, 4047.18it/s] Namespace: ner_labels Token: CARDINAL Traceback (most recent call last): File "parse_raw_text.py", line 146, in predictor.parse_and_save(args.formalism, temp_path, args.output_file) File "/project/rvannoord/am-parser/graph_dependency_parser/components/evaluation/predictors.py", line 132, in parse_and_save predictions = self.dataset_reader.restore_order(forward_on_instances(self.model, instances,self.data_iterator)) File "/project/rvannoord/am-parser/graph_dependency_parser/components/evaluation/iterator.py", line 45, in forward_on_instances dataset.index_instances(model.vocab) File "/project/rvannoord/anaconda3/envs/saarland/lib/python3.7/site-packages/allennlp/data/dataset.py", line 155, in index_instances instance.index_fields(vocab) File "/project/rvannoord/anaconda3/envs/saarland/lib/python3.7/site-packages/allennlp/data/instance.py", line 72, in index_fields field.index(vocab) File "/project/rvannoord/anaconda3/envs/saarland/lib/python3.7/site-packages/allennlp/data/fields/sequence_label_field.py", line 98, in index for label in self.labels] File "/project/rvannoord/anaconda3/envs/saarland/lib/python3.7/site-packages/allennlp/data/fields/sequence_label_field.py", line 98, in for label in self.labels] File "/project/rvannoord/anaconda3/envs/saarland/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 630, in get_token_index return self._token_to_index[namespace][self._oov_token] KeyError: '@@UNKNOWN@@'

namednil commented 4 years ago

Hi Rik, sorry for the late response. I forgot to adjust my notifications. I fixed the first bug you mentioned and the second one should also not give you headaches anymore. In fact, this was a half-heartedly implemented feature and you would have needed to supply the --extend-vocab option.

Are you aware that there is scripts/predict_from_raw_text.sh that basically executes this command and applies the necessary post-processing so you get the AMR graph in one go? This should avoid evaluating the AM dependency tree to the graph and then calling the post-processing scripts.

RikVN commented 4 years ago

Thanks, it works now! (Yes I knew about the raw text parse, I just isolated the problem for myself a bit more and forgot that I started with that script).