Wordnet - Githubissues

megodoonch commented 5 years ago

We forgot to have Wordnet put on the MRP whitelist, and now we shouldn't use it beyond what the JAMR aligner does (= stemming).

megodoonch commented 5 years ago

From the supplementary material of ACL 2018 paper:

" For unseen words, if the lexi-calized node has outgoing ARGx edges, we firsttry to find a verb lemma for the word in WordNet(Miller, 1995) (we use version 3.0). If that fails,we try, again in WordNet, to find the closest verbderivationally related to any lemma of the word. Ifthat also fails, we take the word literally. In anycase, we add “-01” to the label"

alexanderkoller commented 5 years ago

Wordnet is used in two places in am-parser:

de.saar.coli.amrtagging.formalisms.amr.tools.aligner.WordnetEnumerator and classes that use it (to find/extend alignments; this uses lots of Wordnet relations, especially hypernymy, to find nodes that should be added to blobs)
de.saar.coli.amrtagging.formalisms.amr.tools.Relabel (this uses especially the derivationally-related Wordnet relation to reconstruct node labels)

We should proceed as follows:

train and evaluate an as-is version of the AMR parser
disable all uses of Wordnet and see what that does to the accuracy
disable all uses of Wordnet except for the identification of persons and things, because those could potentially be recovered by the NE recognizer.

alexanderkoller commented 5 years ago

@namednil and I just looked at ConceptNet, and it looks as if this might actually be good enough as a Wordnet replacement:

There are no word senses, but (I think) we don't disambiguate word senses anyway.
There are hypernymy and some other relations. "Baker" is a hyponym of "person", and "fork" is a transitive hyponym of "thing".
ConceptNet contains a lemmatizer that seems comparable to the stemmer in Wordnet (but is only accessible through its Python interface).

ConceptNet can be downloaded from their website (500MB gzipped CSV file). It can be accessed using their Python library or this experimental Java library. The source code of the Java library has a very permissive license, so we could just copy it into our own source tree if this is convenient, if we make sure the copyright notice is preserved.

alexanderkoller commented 5 years ago

I'm now factorizing out all uses of Wordnet, so we can replace them with something else if need be.

alexanderkoller commented 5 years ago

All access to Wordnet now goes through the class WordnetEnumerator, which implements a new interface IWordnet. Code outside of WordnetEnumerator only ever uses the methods declared in IWordnet, so we can replace WordnetEnumerator by some other implementation by simply implementing IWordnet differently.

Most methods in IWordnet are straightforward; I would argue that findNounStem and findVerbStem can simply stay as they are, as they are used like that in the JAMR aligner. The other methods could probably be reimplemented in terms of Conceptnet. One method that is a bit mysterious to me is getWNCandidates; it would be good if someone could look at it to find out what it does exactly.

alexanderkoller commented 5 years ago

A similar refactorization could probably be done for #21 and #22.

megodoonch commented 5 years ago

To run the preprocessing on a minicorpus and then get the conll file with the AM dependency trees:

Put your minicorpus directory here: /proj/irtg/amrtagging/amr-dependency-july2019/amr-dependency/ (ie in your directory should be train/, dev/, and test/)
cd to /proj/irtg/amrtagging/amr-dependency-july2019/amr-dependency/scripts/

run

bash preprocess-no-baseline.sh -m <yourcorpusdirectoryname>

Follow the AMR instructions at the bottom of this https://github.com/coli-saar/am-parser/wiki/Converting-individual-formalisms-to-AM-CoNLL#amr

where \<nnData> is yourcorpusdirectoryname/data/nnData/

megodoonch commented 5 years ago

HOWEVER, you should probably wait as we're just updating the Alto version called, and therefore also the preprocessing script. The preprocessor calls the Alto jar file that's in amr-dependency, which is not the current version, and is not what you would be changing.

megodoonch commented 5 years ago

@alexanderkoller asked me to find out when DependencyExtractorCLI changed. It seems to have changed between BitBucket and GitHub: the version I have is still on Bitbucket, and the init on GitHub is pretty much the one you have.

megodoonch commented 5 years ago

Ah! But some default argument values point to Matthias. And the files he writes seem to contain all the same info as in sentences.txt etc, so maybe he can help extract this information efficiently.

megodoonch commented 5 years ago

Okay, a new am-tools is pushed that runs the preprocessor on the minicorpus. I also pushed the bash script to am-parser.

megodoonch commented 5 years ago

WAIT no, I think something's wrong...

megodoonch commented 5 years ago

Nope, we're good, just NetBeans being weird.

alexanderkoller commented 5 years ago

I have now pushed a version of am-tools which can use ConceptNet for semantic relations in the aligner (by passing "--conceptnet x" on the command line) and a version of am-parser with a preprocessing script that uses the --conceptnet flag.

A (hopefully) suitable version of ConceptNet is in /proj/irtg/amrtagging/amr-dependency-july2019/amr-dependency/data/

I am closing this issue because the basic functionality seems to work (tested on the "mini" corpus). This should now be run on the real data to see how it performs.

namednil commented 5 years ago

The preprocessing is running.

coli-saar / am-parser

Wordnet #19