amir-zeldes / gum

Repository for the Georgetown University Multilayer Corpus (GUM)
https://gucorpling.org/gum/
Other
87 stars 50 forks source link

Better constituent parsing #63

Closed amir-zeldes closed 3 years ago

amir-zeldes commented 3 years ago

Replace CoreNLP with SOA neural constituent parser.

nitinvwaran commented 3 years ago

I can take a look at this, (I found the windows dependency). Starting by searching for a SOTA parser.

amir-zeldes commented 3 years ago

Thanks @nitinvwaran , another possible avenue if you're researching this is dependency to constituent conversion, since we have gold dependency parses. I'm not sure what would be more accurate ATM: SOTA constituent parses from gold tags (but the parser is then still trained on different domains than our corpus), or conversions (which are probably not 100% correct, but I don't know current numbers)

nitinvwaran commented 3 years ago

The current SOTA in constituent parsing is listed and collated here by Sebastian Ruder et al., ranked by F1 scores

In the list (focusing on self-attention based architectures only)

I could find very little for dependency to constituent conversion. The latest paper i could find is here, with no codebase support. Another relatively recent paper is here with codebase here. The former paper only treats conversion from UD to constituent trees. The best F1 reached by this paper is 95.6 on Penn Treebank using Stanford Dependencies, and 90.48 using UD on EWT.

Based on these findings, would you recommend any next steps @amir-zeldes ? (Currently thinking of starting with constituent parsers with POS inputs)

amir-zeldes commented 3 years ago

Thanks for putting together this overview! Assuming the top parser outputs standard PTB constituent trees (I see the best one also uses HPSG inspired representations internally, but not as output it seems?), then it would be a good starting place indeed, and seems much better than the Stanford Parser we're using right now. Removing non-Python dependencies is also a plus, and not adding Spacy as a preprocessor is maybe good, since it has large models. Does the best parser have a way of fetching its XLNet embeddings by itself?

From a format perspective, the output should be serialized here and look like the files in there to be compatible with the rest of the pipeline.

nitinvwaran commented 3 years ago

Took a closer look at the code for the best model, the XLNet Model used is initialized and loaded directly from the transformers library. The HPSG span representations are decoded in-memory but then the spans are immediately converted into constituent trees, so the HPSG spans aren't saved down. I think its PTB trees being saved down, will need to debug and check.

amir-zeldes commented 3 years ago

Sounds perfect, thanks!

nitinvwaran commented 3 years ago

Have been able to generate constituent parses using the LabelAttention + HPSG + XLNet model , using the PTB pos tags from the xml as input. Also tested the build bot, the pipeline runs to completion and i get the 'Conversion ended successfully' at the end of the pepper module.

It ran to completion using python 3.7 and the latest torch (1.7.0) and transformers (4.0.0). I could only test this on pytorch CPU version because my GPU (8GB RAM) ran out of memory. There is also a cython (0.29) dependency which needs an extra one-off setup step for which there is now a setup script. I've attached the full environment.yml that was used to run this on Ubuntu. Need to also test on windows / mac, next step.

Should I check in the new .ptb files in the folder, or just the code changes? environment.zip

amir-zeldes commented 3 years ago

Very cool, thanks for this great work! A few thoughts on this:

If you score this new parser on the set above, could you also run the scorer against the current cached .ptb files? Then we would have an idea of the level of improvement over CoreNLP.

amir-zeldes commented 3 years ago

One caveat I just remembered: the gold trees in the sample linked above are from a slightly older version of the corpus, in which the preposition "to" was still tagged as TO (old PTB guidelines). Now they are tagged as IN, and TO is only used for the infinitive "to". You can either substitute the POS tags in the gold trees using current tags from the GUM repo (based on token indices), or else you can score the trees while ignoring POS tags (esp. since we assume gold tag input, so accuracy on tags is irrelevant).

nitinvwaran commented 3 years ago

OK, i think this might be a popular tool for evaluating constituent trees, i'll give this a go: https://nlp.cs.nyu.edu/evalb/

As an incidental note, would it be ok if i removed everything in this folder below, from the repository? Something like

git rm -r --cached _build/utils/pepper/tmp/

I'm guessing, these are copies of the various annotation formats being used by the pepper merge process. Because they are already tracked, I need to keep reverting changes to these after the build process (which is slightly inconvenient), and the .gitignore wont ignore them i think until they are removed....

amir-zeldes commented 3 years ago

Yes, I went ahead and removed them in dev, thanks for pointing it out. I would leave the committed .ptb files in _build/target/const/, which makes it possible to rebuild the corpus even without the requirements for the constituent parsing (which is also slow to run, and not everyone needs it).

nitinvwaran commented 3 years ago

A few updates:

If this looks OK to go, i'll raise a PR to merge.

amir-zeldes commented 3 years ago

OK, I've had a chance to look at this now, here are some thoughts:

amir-zeldes commented 3 years ago

But looking at it qualitatively, the parser is quite good, except maybe difficult constructions, for example here's LAL pred:

image

amir-zeldes commented 3 years ago

And here's gold:

image

nitinvwaran commented 3 years ago

I too a closer look at EVALBs README file, there are a few things I didn't clarify in my earlier comment:

crossing_error crossing_error_2

nitinvwaran commented 3 years ago

Just corrected the lower quote glyph error. This was done by the parser, which maintains a mapping that converted the " to the lower glyph, for some reason. I removed the mapping by the parser, then reran the gum pipeline to rebuild the trees, and redid the EVALB evaluation. The evaluation results after this change and run are identical to the previous one, have attached them in this comment for an audit. gumsample_2.zip

Also added a new python script that checks the tokens between the coreNLP parse and the lal parse. An example to invoke it is:

gum/_build/utils$ python checkptbtrees.py -o << path to directory with coreNLP trees>> -n << path to dir with new LAL trees, default is ../target/const/>>

The token parity test was successful, suggesting the only issue was the lower quote glyph.

amir-zeldes commented 3 years ago

OK, fully tested, works like a charm, thanks!