Better constituent parsing

amir-zeldes commented 3 years ago

Replace CoreNLP with SOA neural constituent parser.

Get rid of hardwired Windows dependency (look for lexparser_eng_const_plus.bat)
Need to choose a parser that accepts gold POS tag input (since we have gold POS)

nitinvwaran commented 3 years ago

I can take a look at this, (I found the windows dependency). Starting by searching for a SOTA parser.

amir-zeldes commented 3 years ago

Thanks @nitinvwaran , another possible avenue if you're researching this is dependency to constituent conversion, since we have gold dependency parses. I'm not sure what would be more accurate ATM: SOTA constituent parses from gold tags (but the parser is then still trained on different domains than our corpus), or conversions (which are probably not 100% correct, but I don't know current numbers)

nitinvwaran commented 3 years ago

The current SOTA in constituent parsing is listed and collated here by Sebastian Ruder et al., ranked by F1 scores

In the list (focusing on self-attention based architectures only)

the Label Attention Layer based paper (1st in list) accepts POS tags and provides an interface to bring your own tags,
the 2nd, 3rd paper (HPSG-based parser) take POS tag inputs through a CoNLL-X format file (with .sd extension) that contains POS tags, but authors do not provide an example file. The codebase is here
The 5th paper (Berkley Parser) may be the simplest self-attention-based paper to use (as it is well integrated in spaCy pipeline) but does not provide an interface to supply POS tags (the POS tags are sourced either from the model itself, or using nltk or spaCy pipelines).
The 4th paper (CRF based parser) does not take POS tag input. The codebase for the paper is here. it looks easy to use.

I could find very little for dependency to constituent conversion. The latest paper i could find is here, with no codebase support. Another relatively recent paper is here with codebase here. The former paper only treats conversion from UD to constituent trees. The best F1 reached by this paper is 95.6 on Penn Treebank using Stanford Dependencies, and 90.48 using UD on EWT.

Based on these findings, would you recommend any next steps @amir-zeldes ? (Currently thinking of starting with constituent parsers with POS inputs)

amir-zeldes commented 3 years ago

Thanks for putting together this overview! Assuming the top parser outputs standard PTB constituent trees (I see the best one also uses HPSG inspired representations internally, but not as output it seems?), then it would be a good starting place indeed, and seems much better than the Stanford Parser we're using right now. Removing non-Python dependencies is also a plus, and not adding Spacy as a preprocessor is maybe good, since it has large models. Does the best parser have a way of fetching its XLNet embeddings by itself?

From a format perspective, the output should be serialized here and look like the files in there to be compatible with the rest of the pipeline.

nitinvwaran commented 3 years ago

Took a closer look at the code for the best model, the XLNet Model used is initialized and loaded directly from the transformers library. The HPSG span representations are decoded in-memory but then the spans are immediately converted into constituent trees, so the HPSG spans aren't saved down. I think its PTB trees being saved down, will need to debug and check.

amir-zeldes commented 3 years ago

Sounds perfect, thanks!

nitinvwaran commented 3 years ago

Have been able to generate constituent parses using the LabelAttention + HPSG + XLNet model , using the PTB pos tags from the xml as input. Also tested the build bot, the pipeline runs to completion and i get the 'Conversion ended successfully' at the end of the pepper module.

It ran to completion using python 3.7 and the latest torch (1.7.0) and transformers (4.0.0). I could only test this on pytorch CPU version because my GPU (8GB RAM) ran out of memory. There is also a cython (0.29) dependency which needs an extra one-off setup step for which there is now a setup script. I've attached the full environment.yml that was used to run this on Ubuntu. Need to also test on windows / mac, next step.

Should I check in the new .ptb files in the folder, or just the code changes? environment.zip

amir-zeldes commented 3 years ago

Very cool, thanks for this great work! A few thoughts on this:

It would be nice to know how well the parser works for our genres since it was trained out of domain
There is a tiny test set of gold constituent parses here that you can evaluate on
In general, we've pushed .ptb files in the past, allowing users who can't run the parser to build GUM without the -p option, which works fine using the cached parses, so I think that's a good plan

If you score this new parser on the set above, could you also run the scorer against the current cached .ptb files? Then we would have an idea of the level of improvement over CoreNLP.

amir-zeldes commented 3 years ago

One caveat I just remembered: the gold trees in the sample linked above are from a slightly older version of the corpus, in which the preposition "to" was still tagged as TO (old PTB guidelines). Now they are tagged as IN, and TO is only used for the infinitive "to". You can either substitute the POS tags in the gold trees using current tags from the GUM repo (based on token indices), or else you can score the trees while ignoring POS tags (esp. since we assume gold tag input, so accuracy on tags is irrelevant).

nitinvwaran commented 3 years ago

OK, i think this might be a popular tool for evaluating constituent trees, i'll give this a go: https://nlp.cs.nyu.edu/evalb/

As an incidental note, would it be ok if i removed everything in this folder below, from the repository? Something like

git rm -r --cached _build/utils/pepper/tmp/

I'm guessing, these are copies of the various annotation formats being used by the pepper merge process. Because they are already tracked, I need to keep reverting changes to these after the build process (which is slightly inconvenient), and the .gitignore wont ignore them i think until they are removed....

amir-zeldes commented 3 years ago

Yes, I went ahead and removed them in dev, thanks for pointing it out. I would leave the committed .ptb files in _build/target/const/, which makes it possible to rebuild the corpus even without the requirements for the constituent parsing (which is also slow to run, and not everyone needs it).

nitinvwaran commented 3 years ago

A few updates:

Tested the pipeline with new constituency parser on MacOS and Windows 10, successfully, including Reddit. Checked in the new ptb files in target/const. I added a new README.md in the _build folder with instructions on how to setup a conda environment and compile cython dependencies to run the parser.
The original model 'best_parser.pt' is downloaded from the author's google drive location. Would this need to be saved to the on-premise file server, maybe the one currently used for amalgum
Ran the EVALB tool, comparing the gold parses against the coreNLP parses and the new parses (a.k.a 'lal parser'). I didn't correct the tags in the gold parses. I didn't include the reddit gold parse because I couldn't retrieve the tokens. The F-Measure is based on counting the number of correct constituents in the parse, which I think is agnostic of the label? There is also a separate measure 'label accuracy' which happens to be identical between the old and new parser (98.97%). In summary, the F-Measure for all sentences is 88.37% (lal) vs. 85.56 (coreNLP). The lal parser seems to have much fewer overlapping / crossing constituents, and also higher precision than the core NLP parser. But interestingly, the core NLP parser produces more 'Complete Matches' with perfect precision and recall both.
Have attached the evaluation results and raw data in this comment. The lal results / test data are in 'lal.rslt' / 'lal.tst', the core NLP results / data in 'stanford.rslt' / 'stanford.tst' . Some parses errored during evaluation, but the error was exactly consistent between the lal and coreNLP parses (can be seen in 'evaluation_errors.txt') gumsample.zip

If this looks OK to go, i'll raise a PR to merge.

amir-zeldes commented 3 years ago

OK, I've had a chance to look at this now, here are some thoughts:

The eval script indeed seems to ignore constituent labels, which are not totally irrelevant here - the score is just telling us that the new parser has some better bracketing, which is promising
Label accuracy must be referring to POS tags, and 98.97 is telling us the proportion of POS tags which are unchanged in the current GUM (v6.2) compared to when the gold trees were made (probably mostly cases of 'to' and hyphen meaning 'to' between numbers)
The amount of phrase label changes is probably very high, but mainly because:
- The new parser uses TOP for the root, and Stanford uses ROOT (this change is meaningless, but would be reflected if non-terminal label were being scored)
- The new parser model does not use function labels (after a hyphen), such as NP-TMP for temporally used NPs (see sent 2 of bio_emperor for example)
I'm not sure I understand the result about overlapping/crossing constituents - those should be impossible in the PTB format, no? Or do you mean a kind of error?
I couldn't identify the sentences in eval_errors - I thought this: 135 : Length unmatch (8|7) would mean that sent 135 in the input has 8 tokens in gold but 7 tokens in GUM input; but sent 135 is longer than that. Can you tell me the sentence triggering one of the errors? Then I can look into the reason some more.
I also noticed some quotation mark glyph changes that we should keep an eye on - we want token parity across formats. For example bio_emperor has the 'low' round double quote glyph „ but gold has straight double quotes. Is this something that happens in the parser, or elsewhere in the pipeline?

amir-zeldes commented 3 years ago

But looking at it qualitatively, the parser is quite good, except maybe difficult constructions, for example here's LAL pred:

amir-zeldes commented 3 years ago

And here's gold:

nitinvwaran commented 3 years ago

I too a closer look at EVALBs README file, there are a few things I didn't clarify in my earlier comment:

I missed this earlier but It seems that the precision / recall figures also seems to take the label of the constituent into account. There is a flag in the default parameter file of the tool, which is turned on by default, and if turned on it means:
- To give labelled precision/recall figures, i.e. a constituent must have the same span and label as a constituent in the goldfile.
The new parser does not generate the TOP root node, nor does it generate a ROOT root node by default. I added a post-processing step in the build pipeline, to add a new ROOT root node. For evaluation purposes, i then changed the ROOT to TOP for all 3 parses , gold, LAL, and coreNLP.
The Label accuracy excludes some specified tags from the counts. For the evaluation, these tags are below. In addition, as TOP is a non-terminal label, the brackets for TOP are removed while its child nodes are kept, in the evaluation pre-processing step.
- TOP
- -NONE-
- ,
- :
- ``
- ''
- .
However unlike TOP, as you mention there would probably be changes from non-usage of functional tags in the new parses. The scoring treatment from the README below, might also mask the extent of the changes:
- The scorer also removes all functional tags attached to non-terminals (functional tags are prefixed with "-" or "=" in the treebank). For example "NP-SBJ" is processed to give "NP", "NP=2" is changed to "NP".
The 'crossing constituent' seems to be an error metric, by comparing the parse with the gold parse. I found an explanation of crossing constituent from some Lecture slides. According to the slides, these are the two flavours of this error:

crossing_error crossing_error_2

The sentence indices in the error list are correct, the evaluation pre-processing excludes tokens whose POS tags are in the list above, which is reflected in the sentence length. The difference in the sentence lengths given this, seems to be due to a different tag in the parses from the gold output (which different tag is then deleted), though given the errors are consistent, the same wrong tag may be used in both the coreNLP and LAL parse. The other type is because of the tokenization of 'cannot' in interview_peres.xml. It is tokenized as can not in the xml and as 1 word cannot in the gold parse. I've attached some more detail on the errors: errors_detail.docx
I will need to follow up with a script that compares the tokens between coreNLP and LAL parses, and check the reason for the quotation mark glyph change.

nitinvwaran commented 3 years ago

Just corrected the lower quote glyph error. This was done by the parser, which maintains a mapping that converted the " to the lower glyph, for some reason. I removed the mapping by the parser, then reran the gum pipeline to rebuild the trees, and redid the EVALB evaluation. The evaluation results after this change and run are identical to the previous one, have attached them in this comment for an audit. gumsample_2.zip

Also added a new python script that checks the tokens between the coreNLP parse and the lal parse. An example to invoke it is:

gum/_build/utils$ python checkptbtrees.py -o << path to directory with coreNLP trees>> -n << path to dir with new LAL trees, default is ../target/const/>>

The token parity test was successful, suggesting the only issue was the lower quote glyph.

amir-zeldes commented 3 years ago

OK, fully tested, works like a charm, thanks!

amir-zeldes / gum

Better constituent parsing #63