Replace NER in AMR pipeline

alexanderkoller commented 5 years ago

In the old AMR pipeline, we used the Stanford CoreNLP library to do POS tagging and NER. For the shared task, we can use gold POS tags, but need to use the UIUC NER tool.

Extend the AMR pipeline so it works with this new setup. Ideally, users should be able to choose a POS tagging and NER setup when they call the AMR preprocessor.

alexanderkoller commented 5 years ago

I factored out the NER functionality into a separate class and implemented both Stanford and UIUC. I still need to test whether they use the same tagsets, and whether the identity of the tags matters anywhere.

There is a use of CRFClassifier in NERTest which, I think, is not used from anywhere, so I will leave it alone.

However, there are many places where the Stanford NER tagger is called without use of this class, see below.

@namednil Many of those uses are for graphbanks other than AMR. Is this an issue? Do we need to replace Stanford NER by UIUC in all of those toolchains too?

src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/ToAMConll.java:            List<String> nerTags = stanfSent.nerTags();
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/ToAMConll.java:                ners.set(origPositions.get(j), nerTags.get(j));
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/DependencyExtractorCLITimeout.java:                    List<String> origNE = new ArrayList<>(stanfAn.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/DependencyExtractorCLITimeout.java:                    List<String> nerTags = new ArrayList<>();
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/DependencyExtractorCLITimeout.java:                        nerTags.add(origNE.get(span.start));
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/DependencyExtractorCLITimeout.java:                    Future<ConllSentence> future = executor.submit(new Task(instance, dictionary, posTags, lemmas, nerTags, sent));
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/DependencyExtractorCLITimeout.java:        private final List<String> nerTags;
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/DependencyExtractorCLITimeout.java:        public Task(MRInstance inst,SupertagDictionary dict, List<String> posTags, List<String> lemmas, List<String> nerTags, List<String> repl){
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/DependencyExtractorCLITimeout.java:            this.nerTags = nerTags;
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/DependencyExtractorCLITimeout.java:                     cs.addNEs(nerTags);
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/NERTagger.java:            List<String> nerTags = stanfSent.nerTags();
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/NERTagger.java:            for (int j = 0; j < nerTags.size(); j++){
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/NERTagger.java:                ners.set(origPositions.get(j), nerTags.get(j));
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/PrepareTestDataFromFiles.java:            List<String> nerTags = stanfSent.nerTags();
src/main/java//de/saar/coli/amrtagging/formalisms/amr/tools/PrepareTestDataFromFiles.java:                ners.set(origPositions.get(j), nerTags.get(j));
src/main/java//de/saar/coli/amrtagging/formalisms/eds/tools/CreateCorpusParallel.java:                    List<String> neTags = new ArrayList<>(stanfAn.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/eds/tools/PrepareTestData.java:            ArrayList<String> neTags = new ArrayList<>(stanfordSent.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/eds/tools/CreateCorpus.java:                    //List<String> neTags = new ArrayList<>(stanfAn.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/ucca/tools/CreateCorpus.java:                    //sent.addNEs(stanfAn.nerTags()); //slow, only add this for final creation of training data
src/main/java//de/saar/coli/amrtagging/formalisms/ud/tools/Tagger.java:            sent.addNEs(stanfSent.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/sdp/tools/PrepareFinalTestData.java:            List<String> neTags = new ArrayList<>(stanfordSent.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/sdp/tools/PrepareDevData.java:            List<String> neTags = new ArrayList<>(stanfordSent.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/sdp/psd/tools/CreateCorpusSorted.java:                    List<String> neTags = new ArrayList<>(stanfAn.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/sdp/psd/tools/CreateCorpusParallel.java:                        List<String> neTags = new ArrayList<>(stanfAn.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/sdp/psd/tools/CreateCorpus.java:                    List<String> neTags = new ArrayList<>(stanfAn.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/sdp/dm/tools/CreateCorpusParallelRandom.java:                    List<String> neTags = new ArrayList<>(stanfAn.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/sdp/dm/tools/CreateCorpusParallel.java:                    List<String> neTags = new ArrayList<>(stanfAn.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/sdp/dm/tools/CreateCorpus.java:                    List<String> neTags = new ArrayList<>(stanfAn.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/sdp/pas/tools/CreateCorpusParallel.java:                    List<String> neTags = new ArrayList<>(stanfAn.nerTags());
src/main/java//de/saar/coli/amrtagging/formalisms/sdp/pas/tools/CreateCorpus.java:                    List<String> neTags = new ArrayList<>(stanfAn.nerTags());

alexanderkoller commented 5 years ago

Digging through the preproc script some more, is it possible that NER is never used on the training data? If I read the code correctly, we use MakeDevData on the dev and test set to perform NER there. ToAMConll also does NER, although I don't see where it gets called for AMR. But are the named entities perhaps detected at training time through the alignments to the AMR graph?

namednil commented 5 years ago

Digging through the preproc script some more, is it possible that NER is never used on the training data? [...] But are the named entities perhaps detected at training time through the alignments to the AMR graph?

Interesting... I mean, it makes sense.

ToAMConll also does NER, although I don't see where it gets called for AMR.

It's the line stanfSent.nerTags() that calls the NER tagger.

We don't use most of the classes you listed for the shared task. We definitely need it in ToAMConll and in MakeDevData. Also, we may want to use it in de.saar.coli.amrtagging.mrp.tools.CreateCorpus* to have a cleaner system description that we use NE tags for all graphbanks (maybe it also helps a little).

Which tagset do we use for NER? Since AMR distinguishes between countries, cities etc. we might want to use a fine-grained version.

alexanderkoller commented 5 years ago

Interesting... I mean, it makes sense.

... if you have perfect alignments. I wonder if the aligner could benefit from NER at training time. Certainly when we look into unsupervised methods, it should.

ToAMConll also does NER, although I don't see where it gets called for AMR.

It's the line stanfSent.nerTags() that calls the NER tagger.

Yes, I know, and I already replaced it. My question was where ToAMConll gets called in the AMR pipeline. I can't find a call to it there.

We don't use most of the classes you listed for the shared task. We definitely need it in ToAMConll and in MakeDevData. Also, we may want to use it in

Okay, these are the two places where I made UIUC easier to access today. I can't test ToAMConll, though, because I don't know where it is called.

de.saar.coli.amrtagging.mrp.tools.CreateCorpus* to have a cleaner system description that we use NE tags for all graphbanks (maybe it also helps a little).

Maybe explain that to me tomorrow. It is probably not hard to do.

Which tagset do we use for NER? Since AMR distinguishes between countries, cities etc. we might want to use a fine-grained version.

I used the same four-class CoNLL tagset that the Stanford parser used (person/organization/location/misc). This seemed safer than changing two things at once. But as far as I can tell, the actual NER tag is never used, so a more fine-grained tagset would change nothing.

namednil commented 5 years ago

My question was where ToAMConll gets called in the AMR pipeline. I can't find a call to it there.

I have been wanting to add this for a few days. I just added it to the bash scripts and pushed it.

alexanderkoller commented 5 years ago

I just pushed a commit that adds NER to de.saar.coli.amrtagging.mrp.tools.CreateCorpus* in the place you requested. Please test it.

We don't need to touch the CreateCorpus methods in the individual formalisms (see above), right?

Can we close this issue?

namednil commented 5 years ago

No, we don't. Yes, I think we can close the issue.

alexanderkoller commented 5 years ago

yayyyyyyyy

coli-saar / am-parser

Replace NER in AMR pipeline #3