Closed namednil closed 5 years ago
All places in am-tools that use NER now accept a new command-line option --uiuc-ner-tagset
that can take the values NER_CONLL
(for the four-class tagset we have used so far; this is still the default) and NER_ONTONOTES
. It is probably important that we rerun the entire preprocessing pipeline with consistent NER tags. Reassigning the issue back to you for this, @namednil.
See here for the NER tags supported by Ontonotes: https://spacy.io/api/annotation#named-entities
Do you any idea, what's going on here?
java -Xmx700G -cp am-tools-all.jar de.saar.coli.amrtagging.formalisms.amr.tools.datascript.MakeDevData -c newNER//data/alto/test/ -o newNER//data/nnData/test/ --companion /proj/irtg/sempardata/mrp/LDC2019E45/2019/companion/test_companion.conllu --uiuc-ner-tagset NER_ONTONOTES >>newNER//data/preprocessLog 2>&1
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Downloading the folder from datastore . . .
GroupId: readonly.org.cogcomp.gazetteers
ArtifactId: 1.6/gazetteers.zip
The target /home/CE/mlinde/.cogcomp-datastore/readonly.org.cogcomp.gazetteers/1.6/gazetteers already exists. Skipping download from the datastore . . .
Downloading the folder from datastore . . .
GroupId: readonly.org.cogcomp.brown-clusters
ArtifactId: 1.5/brown-clusters.zip
The target /home/CE/mlinde/.cogcomp-datastore/readonly.org.cogcomp.brown-clusters/1.5/brown-clusters already exists. Skipping download from the datastore . . .
Downloading the folder from datastore . . .
GroupId: readonly.edu.illinois.cs.cogcomp.ner
ArtifactId: 4.0/ner-model-ontonotes-all-data.zip
augmentedGroupId: readonly.edu.illinois.cs.cogcomp.ner
versionedFileName: 4.0/ner-model-ontonotes-all-data.zip
zippedFileName: /home/CE/mlinde/.cogcomp-datastore/tmp/4.0/ner-model-ontonotes-all-data.zip
file unzip : /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level1.lex
file unzip : /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level2.lex
file unzip : /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level1
file unzip : /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level2
Done
zippedFileName: /home/CE/mlinde/.cogcomp-datastore/tmp/4.0/ner-model-ontonotes-all-data.zip
path: /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data
artifactId: ner-model-ontonotes-all-data
Model file read from /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level1
Model file read from /home/CE/mlinde/.cogcomp-datastore/readonly.edu.illinois.cs.cogcomp.ner/4.0/ner-model-ontonotes-all-data/model/OntoNotes.model.level2
Exception in thread "main" java.lang.IllegalArgumentException: View NER_CONLL not found
at edu.illinois.cs.cogcomp.core.datastructures.textannotation.AbstractTextAnnotation.getView(AbstractTextAnnotation.java:134)
at de.saar.coli.amrtagging.formalisms.amr.tools.preproc.UiucNamedEntityRecognizer.tag(UiucNamedEntityRecognizer.java:65)
at de.saar.coli.amrtagging.formalisms.amr.tools.datascript.MakeDevData.makeDevData(MakeDevData.java:143)
at de.saar.coli.amrtagging.formalisms.amr.tools.datascript.MakeDevData.main(MakeDevData.java:100)
Can you try to delete (or move away) your ~/.cogcomp-datastore
? This forces the UIUC tool to re-download everything.
No, it didn't help.
My mistake. I will fix this.
Should be fixed in am-tools now, can you test it?
I guess it works, the recall is terrible though:
69813 O
6 LOC
2 ORGANIZATION
Before, we had:
122491 O
440 LOC
143 MISC
70 PERSON
36 ORGANIZATION
I think, it is very sensitive to capitalization and we feed it lowercased text. I see if I can fix that, that should improve our scores.
Looks like we already use the true cased tokens for the NER. This means that the recall is really that crappy, or there's a bug in our interface. I'd say, let's stick to the simpler tagset.
Agreed.
Currently, we use the 4-class NE tagset: https://github.com/coli-saar/am-parser/issues/3#issuecomment-511452032
Since we are now convinced that the companion data + new entity tagger didn't hurt us, we could now switch back to the more fine-grained tagset, maybe this helps us a little.