allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.71k stars 228 forks source link

Can not reproduce tagging and parsing results #140

Closed yuhui-zh15 closed 5 years ago

yuhui-zh15 commented 5 years ago

Hello,

I'm trying to reproduce tagging and parsing results using the GENIA corpus. I downloaded the officially released models (en_core_sci_sm-0.2.0 and en_core_sci_md-0.2.0) and officially released GENIA corpus (train/dev/test.json). I modified the scripts parser.sh and train_parser_and_tagger.py, and use them to evaluate the models. However, there seems to be large differences between the results reported in the paper, reported in the github repo and my reproduced result.

The numbers are POS, UAS, LAS, respectively.

Could you please check your results? Thanks a lot for your help!

Sincerely, Yuhui

yuhui-zh15 commented 5 years ago

Here is the evaluation funtion I modified from train_parser_and_tagger.py:

def evaluate_parser_and_tagger(train_json_path: str,
                            dev_json_path: str,
                            test_json_path: str,
                            model_path: str = None,
                            ontonotes_path: str = None):
    msg = Printer()

    train_json_path = cached_path(train_json_path)
    dev_json_path = cached_path(dev_json_path)
    test_json_path = cached_path(test_json_path)

    train_corpus = GoldCorpus(train_json_path, dev_json_path)
    test_corpus = GoldCorpus(train_json_path, test_json_path)

    nlp_loaded = util.load_model_from_path(model_path)
    start_time = timer()
    test_docs = test_corpus.dev_docs(nlp_loaded)
    test_docs = list(test_docs)
    nwords = sum(len(doc_gold[0]) for doc_gold in test_docs)
    scorer = nlp_loaded.evaluate(test_docs)
    end_time = timer()
    gpu_wps = None
    cpu_wps = nwords/(end_time-start_time)

    print("Retrained genia evaluation")
    print("Test results:")
    print("UAS:", scorer.uas)
    print("LAS:", scorer.las)
    print("Tag %:", scorer.tags_acc)
    print("Token acc:", scorer.token_acc)
DeNeutoy commented 5 years ago

Hello,

Sorry about this!

Looks like you can reproduce the POS and UAS scores, so that's good.

For LAS, this is definitely some kind of bug, because when I evaluate the trained models I get a LAS of 80.57 for both dev and test, which is different again from you. Please could you try running spacy evaluate en_core_sci_md /path/to/dev/test and let me know what the results are? I want to know if there is a difference between that script above and the official eval command, because that would point to a bug somewhere in the evaluation procedure in spacy.

I am currently at ACL so I won't be able to investigate this today, but rest assured that the results in the paper are robust (they don't match the ones in the repo because we are continuously updating them - training the models is a probabilistic process and also pulls in upstream improvements in the models themselves from spacy, so they are not going to stay tied to the exact scores in the paper) and this is just some bug :)

yuhui-zh15 commented 5 years ago

Hello,

Thank you for your prompt reply! I tried to run spacy evaluate en_core_sci_sm/md /path/to/dev/test and get the following results. It seems the LAS score is almost the same as the score reported in github repo, but the POS and UAS scores are much higher than the provided evaluation script.

------------en_core_sci_sm------------
Time      332.79 s
Words     446014
Words/s   1340
TOK       100.00
POS       99.47
UAS       92.51
LAS       87.33
NER P     0.00
NER R     0.00
NER F     0.00

------------en_core_sci_md------------
Time      78.16 s
Words     446014
Words/s   5706
TOK       100.00
POS       99.55
UAS       93.40
LAS       88.10
NER P     0.00
NER R     0.00
NER F     0.00
DeNeutoy commented 5 years ago

Hmm this is quite strange, because I could actually reproduce your first bug - could you provide some information on exactly the spacy version etc you have installed? I think it's likely that these numbers are incorrect somehow, but it's confusing that they are now higher when previously the scores were lower.

Ideally, if you could verify this in a completely fresh environment and provide me with the exact steps you ran to install everything, I can try and investigate why this is happening. Initially, I thought it was to do with spacy 2.0 - in this version of spacy, they share parameters for the NER and parsing models, because it assumes that they are trained jointly. This means that our pipeline approach, where we train first the tagger and parser, and then the NER afterward, might cause some problem.

DeNeutoy commented 5 years ago

@yuhui-zh15 any chance you can let me know some more details about the evaluation you ran above? I think it would help me to figure this out :)

yuhui-zh15 commented 5 years ago

Hello,

I just simply ran the following commands:

pip install spacy
spacy evaluate en_core_sci_sm /path/to/data
spacy evaluate en_core_sci_md /path/to/data

en_core_sci_sm, en_core_sci_md and /path/to/data are all officially provided by your repo.

spacy version is 2.1.6

DeNeutoy commented 5 years ago

Hmm ok thanks!

DeNeutoy commented 5 years ago

@honnibal - i'd love your opinion here.

Scispacy's NER data is separate from the tagging and parsing data. Because of this, we train first a pipeline with a tagger and parser, and then use this script to add in an NER pipe.

We didn't notice this until now, but when we upgraded from spacy==2.0.18 to spacy==2.1, we observed that after training the NER model, the Labelled Attachment Score of the parser is dropping by about 4%. I think this might be related to the 2.1 release sharing more weights between the parser and NER models - is that right?

A couple of investigations we've done:

It's also a bit weird to me that it's only the LAS that's affected, rather than the whole parser.

It's reasonable if the answer to this is "they need to be trained together". We've been thinking of using a strong BERT model trained on med mentions to annotate generic mention spans in the GENIA corpus anyway, so it's possible that we can just accelerate this.

dakinggg commented 5 years ago

Seems that we found the issue, see the #146 for more detail. @yuhui-zh15 Thank you for discovering this issue!

@honnibal I do still think that this only became a problem with spacy 2.1.x, and find it a bit odd that only LAS was affected, but the issue does seem to be resolved by disabling the parser/tagger pipes when calling nlp.begin_training.