Can not reproduce tagging and parsing results

yuhui-zh15 commented 5 years ago

Hello,

I'm trying to reproduce tagging and parsing results using the GENIA corpus. I downloaded the officially released models (en_core_sci_sm-0.2.0 and en_core_sci_md-0.2.0) and officially released GENIA corpus (train/dev/test.json). I modified the scripts parser.sh and train_parser_and_tagger.py, and use them to evaluate the models. However, there seems to be large differences between the results reported in the paper, reported in the github repo and my reproduced result.

Paper:
- en_core_sci_sm: 98.38 89.69 87.67
- en_core_sci_md: 98.51 90.60 88.79
Github Repo docs/index.md:
- en_core_sci_sm: 98.42 89.47 87.61
- en_core_sci_md: 98.61 89.94 88.08
My reproduced result:
- en_core_sci_sm: 98.42 89.47 84.04
- en_core_sci_md: 98.61 89.94 84.37

The numbers are POS, UAS, LAS, respectively.

Could you please check your results? Thanks a lot for your help!

Sincerely, Yuhui

yuhui-zh15 commented 5 years ago

Here is the evaluation funtion I modified from train_parser_and_tagger.py:

def evaluate_parser_and_tagger(train_json_path: str,
                            dev_json_path: str,
                            test_json_path: str,
                            model_path: str = None,
                            ontonotes_path: str = None):
    msg = Printer()

    train_json_path = cached_path(train_json_path)
    dev_json_path = cached_path(dev_json_path)
    test_json_path = cached_path(test_json_path)

    train_corpus = GoldCorpus(train_json_path, dev_json_path)
    test_corpus = GoldCorpus(train_json_path, test_json_path)

    nlp_loaded = util.load_model_from_path(model_path)
    start_time = timer()
    test_docs = test_corpus.dev_docs(nlp_loaded)
    test_docs = list(test_docs)
    nwords = sum(len(doc_gold[0]) for doc_gold in test_docs)
    scorer = nlp_loaded.evaluate(test_docs)
    end_time = timer()
    gpu_wps = None
    cpu_wps = nwords/(end_time-start_time)

    print("Retrained genia evaluation")
    print("Test results:")
    print("UAS:", scorer.uas)
    print("LAS:", scorer.las)
    print("Tag %:", scorer.tags_acc)
    print("Token acc:", scorer.token_acc)

DeNeutoy commented 5 years ago

Hello,

Sorry about this!

Looks like you can reproduce the POS and UAS scores, so that's good.

For LAS, this is definitely some kind of bug, because when I evaluate the trained models I get a LAS of 80.57 for both dev and test, which is different again from you. Please could you try running spacy evaluate en_core_sci_md /path/to/dev/test and let me know what the results are? I want to know if there is a difference between that script above and the official eval command, because that would point to a bug somewhere in the evaluation procedure in spacy.

I am currently at ACL so I won't be able to investigate this today, but rest assured that the results in the paper are robust (they don't match the ones in the repo because we are continuously updating them - training the models is a probabilistic process and also pulls in upstream improvements in the models themselves from spacy, so they are not going to stay tied to the exact scores in the paper) and this is just some bug :)

yuhui-zh15 commented 5 years ago

Hello,

Thank you for your prompt reply! I tried to run spacy evaluate en_core_sci_sm/md /path/to/dev/test and get the following results. It seems the LAS score is almost the same as the score reported in github repo, but the POS and UAS scores are much higher than the provided evaluation script.

------------en_core_sci_sm------------
Time      332.79 s
Words     446014
Words/s   1340
TOK       100.00
POS       99.47
UAS       92.51
LAS       87.33
NER P     0.00
NER R     0.00
NER F     0.00

------------en_core_sci_md------------
Time      78.16 s
Words     446014
Words/s   5706
TOK       100.00
POS       99.55
UAS       93.40
LAS       88.10
NER P     0.00
NER R     0.00
NER F     0.00

DeNeutoy commented 5 years ago

Hmm this is quite strange, because I could actually reproduce your first bug - could you provide some information on exactly the spacy version etc you have installed? I think it's likely that these numbers are incorrect somehow, but it's confusing that they are now higher when previously the scores were lower.

Ideally, if you could verify this in a completely fresh environment and provide me with the exact steps you ran to install everything, I can try and investigate why this is happening. Initially, I thought it was to do with spacy 2.0 - in this version of spacy, they share parameters for the NER and parsing models, because it assumes that they are trained jointly. This means that our pipeline approach, where we train first the tagger and parser, and then the NER afterward, might cause some problem.

DeNeutoy commented 5 years ago

@yuhui-zh15 any chance you can let me know some more details about the evaluation you ran above? I think it would help me to figure this out :)

yuhui-zh15 commented 5 years ago

Hello,

I just simply ran the following commands:

pip install spacy
spacy evaluate en_core_sci_sm /path/to/data
spacy evaluate en_core_sci_md /path/to/data

en_core_sci_sm, en_core_sci_md and /path/to/data are all officially provided by your repo.

spacy version is 2.1.6

DeNeutoy commented 5 years ago

Hmm ok thanks!

DeNeutoy commented 5 years ago

@honnibal - i'd love your opinion here.

Scispacy's NER data is separate from the tagging and parsing data. Because of this, we train first a pipeline with a tagger and parser, and then use this script to add in an NER pipe.

We didn't notice this until now, but when we upgraded from spacy==2.0.18 to spacy==2.1, we observed that after training the NER model, the Labelled Attachment Score of the parser is dropping by about 4%. I think this might be related to the 2.1 release sharing more weights between the parser and NER models - is that right?

A couple of investigations we've done:

It's not due to a newer spacy version than 2.1.0, the same reported results occur consistently across 2.1.x versions
The amount of training for NER doesn't affect how much the LAS drops off one epoch is the same reduction as N. This made me wonder if we needed to preserve the optimizer state somehow when we reload the previous pipeline with the tagger and parser in it?
Additionally, just adding the NER pipe, calling nlp.begin_training() and writing the new model to disk results in the same decrease in LAS performance.
Before 2.1, the LAS doesn't change after we train the NER model

It's also a bit weird to me that it's only the LAS that's affected, rather than the whole parser.

It's reasonable if the answer to this is "they need to be trained together". We've been thinking of using a strong BERT model trained on med mentions to annotate generic mention spans in the GENIA corpus anyway, so it's possible that we can just accelerate this.

dakinggg commented 5 years ago

Seems that we found the issue, see the #146 for more detail. @yuhui-zh15 Thank you for discovering this issue!

@honnibal I do still think that this only became a problem with spacy 2.1.x, and find it a bit odd that only LAS was affected, but the issue does seem to be resolved by disabling the parser/tagger pipes when calling nlp.begin_training.

allenai / scispacy

Can not reproduce tagging and parsing results #140