Closed yuhui-zh15 closed 5 years ago
Here is the evaluation funtion I modified from train_parser_and_tagger.py
:
def evaluate_parser_and_tagger(train_json_path: str,
dev_json_path: str,
test_json_path: str,
model_path: str = None,
ontonotes_path: str = None):
msg = Printer()
train_json_path = cached_path(train_json_path)
dev_json_path = cached_path(dev_json_path)
test_json_path = cached_path(test_json_path)
train_corpus = GoldCorpus(train_json_path, dev_json_path)
test_corpus = GoldCorpus(train_json_path, test_json_path)
nlp_loaded = util.load_model_from_path(model_path)
start_time = timer()
test_docs = test_corpus.dev_docs(nlp_loaded)
test_docs = list(test_docs)
nwords = sum(len(doc_gold[0]) for doc_gold in test_docs)
scorer = nlp_loaded.evaluate(test_docs)
end_time = timer()
gpu_wps = None
cpu_wps = nwords/(end_time-start_time)
print("Retrained genia evaluation")
print("Test results:")
print("UAS:", scorer.uas)
print("LAS:", scorer.las)
print("Tag %:", scorer.tags_acc)
print("Token acc:", scorer.token_acc)
Hello,
Sorry about this!
Looks like you can reproduce the POS and UAS scores, so that's good.
For LAS, this is definitely some kind of bug, because when I evaluate the trained models I get a LAS of 80.57 for both dev and test, which is different again from you. Please could you try running spacy evaluate en_core_sci_md /path/to/dev/test
and let me know what the results are? I want to know if there is a difference between that script above and the official eval command, because that would point to a bug somewhere in the evaluation procedure in spacy.
I am currently at ACL so I won't be able to investigate this today, but rest assured that the results in the paper are robust (they don't match the ones in the repo because we are continuously updating them - training the models is a probabilistic process and also pulls in upstream improvements in the models themselves from spacy, so they are not going to stay tied to the exact scores in the paper) and this is just some bug :)
Hello,
Thank you for your prompt reply! I tried to run spacy evaluate en_core_sci_sm/md /path/to/dev/test
and get the following results. It seems the LAS score is almost the same as the score reported in github repo, but the POS and UAS scores are much higher than the provided evaluation script.
------------en_core_sci_sm------------
Time 332.79 s
Words 446014
Words/s 1340
TOK 100.00
POS 99.47
UAS 92.51
LAS 87.33
NER P 0.00
NER R 0.00
NER F 0.00
------------en_core_sci_md------------
Time 78.16 s
Words 446014
Words/s 5706
TOK 100.00
POS 99.55
UAS 93.40
LAS 88.10
NER P 0.00
NER R 0.00
NER F 0.00
Hmm this is quite strange, because I could actually reproduce your first bug - could you provide some information on exactly the spacy version etc you have installed? I think it's likely that these numbers are incorrect somehow, but it's confusing that they are now higher when previously the scores were lower.
Ideally, if you could verify this in a completely fresh environment and provide me with the exact steps you ran to install everything, I can try and investigate why this is happening. Initially, I thought it was to do with spacy 2.0 - in this version of spacy, they share parameters for the NER and parsing models, because it assumes that they are trained jointly. This means that our pipeline approach, where we train first the tagger and parser, and then the NER afterward, might cause some problem.
@yuhui-zh15 any chance you can let me know some more details about the evaluation you ran above? I think it would help me to figure this out :)
Hello,
I just simply ran the following commands:
pip install spacy
spacy evaluate en_core_sci_sm /path/to/data
spacy evaluate en_core_sci_md /path/to/data
en_core_sci_sm
, en_core_sci_md
and /path/to/data
are all officially provided by your repo.
spacy
version is 2.1.6
Hmm ok thanks!
@honnibal - i'd love your opinion here.
Scispacy's NER data is separate from the tagging and parsing data. Because of this, we train first a pipeline with a tagger and parser, and then use this script to add in an NER pipe.
We didn't notice this until now, but when we upgraded from spacy==2.0.18
to spacy==2.1
, we observed that after training the NER model, the Labelled Attachment Score of the parser is dropping by about 4%. I think this might be related to the 2.1 release sharing more weights between the parser and NER models - is that right?
A couple of investigations we've done:
nlp.begin_training()
and writing the new model to disk results in the same decrease in LAS performance.It's also a bit weird to me that it's only the LAS that's affected, rather than the whole parser.
It's reasonable if the answer to this is "they need to be trained together". We've been thinking of using a strong BERT model trained on med mentions to annotate generic mention spans in the GENIA corpus anyway, so it's possible that we can just accelerate this.
Seems that we found the issue, see the #146 for more detail. @yuhui-zh15 Thank you for discovering this issue!
@honnibal I do still think that this only became a problem with spacy 2.1.x, and find it a bit odd that only LAS was affected, but the issue does seem to be resolved by disabling the parser/tagger pipes when calling nlp.begin_training
.
Hello,
I'm trying to reproduce tagging and parsing results using the GENIA corpus. I downloaded the officially released models (
en_core_sci_sm-0.2.0
anden_core_sci_md-0.2.0
) and officially released GENIA corpus (train/dev/test.json
). I modified the scriptsparser.sh
andtrain_parser_and_tagger.py
, and use them to evaluate the models. However, there seems to be large differences between the results reported in the paper, reported in the github repo and my reproduced result.docs/index.md
:The numbers are POS, UAS, LAS, respectively.
Could you please check your results? Thanks a lot for your help!
Sincerely, Yuhui