Closed megodoonch closed 5 years ago
From the supplementary material of ACL 2018 paper:
" For unseen words, if the lexi-calized node has outgoing ARGx edges, we firsttry to find a verb lemma for the word in WordNet(Miller, 1995) (we use version 3.0). If that fails,we try, again in WordNet, to find the closest verbderivationally related to any lemma of the word. Ifthat also fails, we take the word literally. In anycase, we add “-01” to the label"
Wordnet is used in two places in am-parser:
We should proceed as follows:
@namednil and I just looked at ConceptNet, and it looks as if this might actually be good enough as a Wordnet replacement:
ConceptNet can be downloaded from their website (500MB gzipped CSV file). It can be accessed using their Python library or this experimental Java library. The source code of the Java library has a very permissive license, so we could just copy it into our own source tree if this is convenient, if we make sure the copyright notice is preserved.
I'm now factorizing out all uses of Wordnet, so we can replace them with something else if need be.
All access to Wordnet now goes through the class WordnetEnumerator, which implements a new interface IWordnet. Code outside of WordnetEnumerator only ever uses the methods declared in IWordnet, so we can replace WordnetEnumerator by some other implementation by simply implementing IWordnet differently.
Most methods in IWordnet are straightforward; I would argue that findNounStem and findVerbStem can simply stay as they are, as they are used like that in the JAMR aligner. The other methods could probably be reimplemented in terms of Conceptnet. One method that is a bit mysterious to me is getWNCandidates; it would be good if someone could look at it to find out what it does exactly.
A similar refactorization could probably be done for #21 and #22.
To run the preprocessing on a minicorpus and then get the conll file with the AM dependency trees:
Put your minicorpus directory here: /proj/irtg/amrtagging/amr-dependency-july2019/amr-dependency/ (ie in your directory should be train/, dev/, and test/)
cd to /proj/irtg/amrtagging/amr-dependency-july2019/amr-dependency/scripts/
run
bash preprocess-no-baseline.sh -m <yourcorpusdirectoryname>
Follow the AMR instructions at the bottom of this https://github.com/coli-saar/am-parser/wiki/Converting-individual-formalisms-to-AM-CoNLL#amr
where \<nnData> is yourcorpusdirectoryname/data/nnData/
HOWEVER, you should probably wait as we're just updating the Alto version called, and therefore also the preprocessing script. The preprocessor calls the Alto jar file that's in amr-dependency, which is not the current version, and is not what you would be changing.
@alexanderkoller asked me to find out when DependencyExtractorCLI changed. It seems to have changed between BitBucket and GitHub: the version I have is still on Bitbucket, and the init on GitHub is pretty much the one you have.
Ah! But some default argument values point to Matthias. And the files he writes seem to contain all the same info as in sentences.txt etc, so maybe he can help extract this information efficiently.
Okay, a new am-tools is pushed that runs the preprocessor on the minicorpus. I also pushed the bash script to am-parser.
WAIT no, I think something's wrong...
Nope, we're good, just NetBeans being weird.
I have now pushed a version of am-tools which can use ConceptNet for semantic relations in the aligner (by passing "--conceptnet x" on the command line) and a version of am-parser with a preprocessing script that uses the --conceptnet flag.
A (hopefully) suitable version of ConceptNet is in /proj/irtg/amrtagging/amr-dependency-july2019/amr-dependency/data/
I am closing this issue because the basic functionality seems to work (tested on the "mini" corpus). This should now be run on the real data to see how it performs.
The preprocessing is running.
We forgot to have Wordnet put on the MRP whitelist, and now we shouldn't use it beyond what the JAMR aligner does (= stemming).