Closed amir-zeldes closed 3 years ago
I can take a look at this, (I found the windows dependency). Starting by searching for a SOTA parser.
Thanks @nitinvwaran , another possible avenue if you're researching this is dependency to constituent conversion, since we have gold dependency parses. I'm not sure what would be more accurate ATM: SOTA constituent parses from gold tags (but the parser is then still trained on different domains than our corpus), or conversions (which are probably not 100% correct, but I don't know current numbers)
The current SOTA in constituent parsing is listed and collated here by Sebastian Ruder et al., ranked by F1 scores
In the list (focusing on self-attention based architectures only)
I could find very little for dependency to constituent conversion. The latest paper i could find is here, with no codebase support. Another relatively recent paper is here with codebase here. The former paper only treats conversion from UD to constituent trees. The best F1 reached by this paper is 95.6 on Penn Treebank using Stanford Dependencies, and 90.48 using UD on EWT.
Based on these findings, would you recommend any next steps @amir-zeldes ? (Currently thinking of starting with constituent parsers with POS inputs)
Thanks for putting together this overview! Assuming the top parser outputs standard PTB constituent trees (I see the best one also uses HPSG inspired representations internally, but not as output it seems?), then it would be a good starting place indeed, and seems much better than the Stanford Parser we're using right now. Removing non-Python dependencies is also a plus, and not adding Spacy as a preprocessor is maybe good, since it has large models. Does the best parser have a way of fetching its XLNet embeddings by itself?
From a format perspective, the output should be serialized here and look like the files in there to be compatible with the rest of the pipeline.
Took a closer look at the code for the best model, the XLNet Model used is initialized and loaded directly from the transformers library. The HPSG span representations are decoded in-memory but then the spans are immediately converted into constituent trees, so the HPSG spans aren't saved down. I think its PTB trees being saved down, will need to debug and check.
Sounds perfect, thanks!
Have been able to generate constituent parses using the LabelAttention + HPSG + XLNet model , using the PTB pos tags from the xml as input. Also tested the build bot, the pipeline runs to completion and i get the 'Conversion ended successfully' at the end of the pepper module.
It ran to completion using python 3.7 and the latest torch (1.7.0) and transformers (4.0.0). I could only test this on pytorch CPU version because my GPU (8GB RAM) ran out of memory. There is also a cython (0.29) dependency which needs an extra one-off setup step for which there is now a setup script. I've attached the full environment.yml that was used to run this on Ubuntu. Need to also test on windows / mac, next step.
Should I check in the new .ptb files in the folder, or just the code changes? environment.zip
Very cool, thanks for this great work! A few thoughts on this:
-p
option, which works fine using the cached parses, so I think that's a good planIf you score this new parser on the set above, could you also run the scorer against the current cached .ptb files? Then we would have an idea of the level of improvement over CoreNLP.
One caveat I just remembered: the gold trees in the sample linked above are from a slightly older version of the corpus, in which the preposition "to" was still tagged as TO
(old PTB guidelines). Now they are tagged as IN
, and TO
is only used for the infinitive "to". You can either substitute the POS tags in the gold trees using current tags from the GUM repo (based on token indices), or else you can score the trees while ignoring POS tags (esp. since we assume gold tag input, so accuracy on tags is irrelevant).
OK, i think this might be a popular tool for evaluating constituent trees, i'll give this a go: https://nlp.cs.nyu.edu/evalb/
As an incidental note, would it be ok if i removed everything in this folder below, from the repository? Something like
git rm -r --cached _build/utils/pepper/tmp/
I'm guessing, these are copies of the various annotation formats being used by the pepper merge process. Because they are already tracked, I need to keep reverting changes to these after the build process (which is slightly inconvenient), and the .gitignore wont ignore them i think until they are removed....
Yes, I went ahead and removed them in dev, thanks for pointing it out. I would leave the committed .ptb files in _build/target/const/, which makes it possible to rebuild the corpus even without the requirements for the constituent parsing (which is also slow to run, and not everyone needs it).
A few updates:
Tested the pipeline with new constituency parser on MacOS and Windows 10, successfully, including Reddit. Checked in the new ptb files in target/const. I added a new README.md in the _build folder with instructions on how to setup a conda environment and compile cython dependencies to run the parser.
The original model 'best_parser.pt' is downloaded from the author's google drive location. Would this need to be saved to the on-premise file server, maybe the one currently used for amalgum
Ran the EVALB tool, comparing the gold parses against the coreNLP parses and the new parses (a.k.a 'lal parser'). I didn't correct the tags in the gold parses. I didn't include the reddit gold parse because I couldn't retrieve the tokens. The F-Measure is based on counting the number of correct constituents in the parse, which I think is agnostic of the label? There is also a separate measure 'label accuracy' which happens to be identical between the old and new parser (98.97%). In summary, the F-Measure for all sentences is 88.37% (lal) vs. 85.56 (coreNLP). The lal parser seems to have much fewer overlapping / crossing constituents, and also higher precision than the core NLP parser. But interestingly, the core NLP parser produces more 'Complete Matches' with perfect precision and recall both.
Have attached the evaluation results and raw data in this comment. The lal results / test data are in 'lal.rslt' / 'lal.tst', the core NLP results / data in 'stanford.rslt' / 'stanford.tst' . Some parses errored during evaluation, but the error was exactly consistent between the lal and coreNLP parses (can be seen in 'evaluation_errors.txt') gumsample.zip
If this looks OK to go, i'll raise a PR to merge.
OK, I've had a chance to look at this now, here are some thoughts:
NP-TMP
for temporally used NPs (see sent 2 of bio_emperor for example)135 : Length unmatch (8|7)
would mean that sent 135 in the input has 8 tokens in gold but 7 tokens in GUM input; but sent 135 is longer than that. Can you tell me the sentence triggering one of the errors? Then I can look into the reason some more.„
but gold has straight double quotes. Is this something that happens in the parser, or elsewhere in the pipeline?But looking at it qualitatively, the parser is quite good, except maybe difficult constructions, for example here's LAL pred:
And here's gold:
I too a closer look at EVALBs README file, there are a few things I didn't clarify in my earlier comment:
I missed this earlier but It seems that the precision / recall figures also seems to take the label of the constituent into account. There is a flag in the default parameter file of the tool, which is turned on by default, and if turned on it means:
To give labelled precision/recall figures, i.e. a constituent must have the same span and label as a constituent in the goldfile.
The new parser does not generate the TOP root node, nor does it generate a ROOT root node by default. I added a post-processing step in the build pipeline, to add a new ROOT root node. For evaluation purposes, i then changed the ROOT to TOP for all 3 parses , gold, LAL, and coreNLP.
The Label accuracy excludes some specified tags from the counts. For the evaluation, these tags are below. In addition, as TOP is a non-terminal label, the brackets for TOP are removed while its child nodes are kept, in the evaluation pre-processing step.
However unlike TOP, as you mention there would probably be changes from non-usage of functional tags in the new parses. The scoring treatment from the README below, might also mask the extent of the changes:
The scorer also removes all functional tags attached to non-terminals (functional tags are prefixed with "-" or "=" in the treebank). For example "NP-SBJ" is processed to give "NP", "NP=2" is changed to "NP".
The 'crossing constituent' seems to be an error metric, by comparing the parse with the gold parse. I found an explanation of crossing constituent from some Lecture slides. According to the slides, these are the two flavours of this error:
The sentence indices in the error list are correct, the evaluation pre-processing excludes tokens whose POS tags are in the list above, which is reflected in the sentence length. The difference in the sentence lengths given this, seems to be due to a different tag in the parses from the gold output (which different tag is then deleted), though given the errors are consistent, the same wrong tag may be used in both the coreNLP and LAL parse. The other type is because of the tokenization of 'cannot' in interview_peres.xml. It is tokenized as can not in the xml and as 1 word cannot in the gold parse. I've attached some more detail on the errors: errors_detail.docx
I will need to follow up with a script that compares the tokens between coreNLP and LAL parses, and check the reason for the quotation mark glyph change.
Just corrected the lower quote glyph error. This was done by the parser, which maintains a mapping that converted the " to the lower glyph, for some reason. I removed the mapping by the parser, then reran the gum pipeline to rebuild the trees, and redid the EVALB evaluation. The evaluation results after this change and run are identical to the previous one, have attached them in this comment for an audit. gumsample_2.zip
Also added a new python script that checks the tokens between the coreNLP parse and the lal parse. An example to invoke it is:
gum/_build/utils$ python checkptbtrees.py -o << path to directory with coreNLP trees>> -n << path to dir with new LAL trees, default is ../target/const/>>
The token parity test was successful, suggesting the only issue was the lower quote glyph.
OK, fully tested, works like a charm, thanks!
Replace CoreNLP with SOA neural constituent parser.
lexparser_eng_const_plus.bat
)