coli-saar / am-parser

Modular implementation of an AM dependency parser in AllenNLP.
Apache License 2.0
30 stars 10 forks source link

Improve recall of NER #62

Closed namednil closed 4 years ago

namednil commented 4 years ago

Currently, we use the 4-class NE tagset: https://github.com/coli-saar/am-parser/issues/3#issuecomment-511452032

Since we are now convinced that the companion data + new entity tagger didn't hurt us, we could now switch back to the more fine-grained tagset, maybe this helps us a little.

alexanderkoller commented 4 years ago

All places in am-tools that use NER now accept a new command-line option --uiuc-ner-tagset that can take the values NER_CONLL (for the four-class tagset we have used so far; this is still the default) and NER_ONTONOTES. It is probably important that we rerun the entire preprocessing pipeline with consistent NER tags. Reassigning the issue back to you for this, @namednil.

alexanderkoller commented 4 years ago

See here for the NER tags supported by Ontonotes: https://spacy.io/api/annotation#named-entities

namednil commented 4 years ago

Do you any idea, what's going on here?

java -Xmx700G -cp am-tools-all.jar de.saar.coli.amrtagging.formalisms.amr.tools.datascript.MakeDevData -c newNER//data/alto/test/ -o newNER//data/nnData/test/ --companion /proj/irtg/sempardata/mrp/LDC2019E45/2019/companion/test_companion.conllu --uiuc-ner-tagset NER_ONTONOTES >>newNER//data/preprocessLog 2>&1
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Downloading the folder from datastore . . . 
                GroupId: readonly.org.cogcomp.gazetteers
                ArtifactId: 1.6/gazetteers.zip
The target /home/CE/mlinde/.cogcomp-datastore/readonly.org.cogcomp.gazetteers/1.6/gazetteers already exists. Skipping download from the datastore . . . 
Downloading the folder from datastore . . . 
                GroupId: readonly.org.cogcomp.brown-clusters
                ArtifactId: 1.5/brown-clusters.zip
The target /home/CE/mlinde/.cogcomp-datastore/readonly.org.cogcomp.brown-clusters/1.5/brown-clusters already exists. Skipping download from the datastore . . . 
Downloading the folder from datastore . . . 
                GroupId: readonly.edu.illinois.cs.cogcomp.ner
                ArtifactId: 4.0/ner-model-ontonotes-all-data.zip
augmentedGroupId: readonly.edu.illinois.cs.cogcomp.ner
versionedFileName: 4.0/ner-model-ontonotes-all-data.zip
zippedFileName: /home/CE/mlinde/.cogcomp-datastore/tmp/4.0/ner-model-ontonotes-all-data.zip
file unzip : /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level1.lex
file unzip : /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level2.lex
file unzip : /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level1
file unzip : /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level2
Done
zippedFileName: /home/CE/mlinde/.cogcomp-datastore/tmp/4.0/ner-model-ontonotes-all-data.zip
path: /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data
artifactId: ner-model-ontonotes-all-data
Model file read from /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level1
Model file read from /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level2
Exception in thread "main" java.lang.IllegalArgumentException: View NER_CONLL not found
        at edu.illinois.cs.cogcomp.core.datastructures.textannotation.AbstractTextAnnotation.getView(AbstractTextAnnotation.java:134)
        at de.saar.coli.amrtagging.formalisms.amr.tools.preproc.UiucNamedEntityRecognizer.tag(UiucNamedEntityRecognizer.java:65)
        at de.saar.coli.amrtagging.formalisms.amr.tools.datascript.MakeDevData.makeDevData(MakeDevData.java:143)
        at de.saar.coli.amrtagging.formalisms.amr.tools.datascript.MakeDevData.main(MakeDevData.java:100)
alexanderkoller commented 4 years ago

Can you try to delete (or move away) your ~/.cogcomp-datastore? This forces the UIUC tool to re-download everything.

namednil commented 4 years ago

No, it didn't help.

alexanderkoller commented 4 years ago

My mistake. I will fix this.

alexanderkoller commented 4 years ago

Should be fixed in am-tools now, can you test it?

namednil commented 4 years ago

I guess it works, the recall is terrible though:

  69813 O
      6 LOC
      2 ORGANIZATION

Before, we had:

 122491 O
    440 LOC
    143 MISC
     70 PERSON
     36 ORGANIZATION

I think, it is very sensitive to capitalization and we feed it lowercased text. I see if I can fix that, that should improve our scores.

namednil commented 4 years ago

Looks like we already use the true cased tokens for the NER. This means that the recall is really that crappy, or there's a bug in our interface. I'd say, let's stick to the simpler tagset.

alexanderkoller commented 4 years ago

Agreed.