DoodleJZ / HPSG-Neural-Parser

Source code for "Head-Driven Phrase Structure Grammar Parsing on Penn Treebank" published at ACL 2019
https://arxiv.org/abs/1907.02684
MIT License
107 stars 25 forks source link
bert elmo hpsg-neural-parser machine-learning natural-language-processing nlp parser parsing syntactic-parsing xlnet

HPSG Neural Parser

This is a Python implementation of the parsers described in "Head-Driven Phrase Structure Grammar Parsing on Penn Treebank" from ACL 2019.

Contents

  1. Requirements
  2. Training
  3. Citation
  4. Credits

Requirements

Pre-trained Models (PyTorch)

The following pre-trained parser models are available for download:

The pre-trained model with Glove embeddings obtains 93.78 F-scores of constituent parsing and 96.09 UAS, 94.68 LAS of dependency parsing on the test set.

The pre-trained model with BERT obtains 95.84 F-scores of constituent parsing and 97.00 UAS, 95.43 LAS of dependency parsing on the test set.

The pre-trained model with XLNet obtains 96.33 F-scores of constituent parsing and 97.20 UAS, 95.72 LAS of dependency parsing on the test set.

To use ELMo embeddings, download the following files into the data/ folder (preserving their names):

There is currently no command-line option for configuring the locations/names of the ELMo files.

Pre-trained BERT and XLNet weights will be automatically downloaded as needed by the pytorch-transformers package.

Training

Download the 3 PTB data files from https://github.com/nikitakit/self-attentive-parser/tree/master/data, and put them in the data/ folder. The dependency structures are mainly obtained by converting constituent structure with version 3.3.0 of Stanford Parser in the data/ folder:

java -cp stanford-parser_3.3.0.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile 02-21.10way.clean > ptb_train_3.3.0.sd

For CTB, we use the same datasets and preprocessing from the Distance Parser. For PTB, we use the same datasets and preprocessing from the self-attentive-parser. GloVe embeddings are optional.

Training Instructions

Some of the available arguments are:

Argument Description Default
--model-path-base Path base to use for saving models N/A
--evalb-dir Path to EVALB directory EVALB/
--train-ptb-path Path to training constituent parsing data/02-21.10way.clean
--dev-ptb-path Path to development constituent parsing data/22.auto.clean
--dep-train-ptb-path Path to training dependency parsing data/ptb_train_3.3.0.sd
--dep-dev-ptb-path Path to development dependency parsing data/ptb_dev_3.3.0.sd
--batch-size Number of examples per training update 250
--checks-per-epoch Number of development evaluations per epoch 4
--subbatch-max-tokens Maximum number of words to process in parallel while training (a full batch may not fit in GPU memory) 2000
--eval-batch-size Number of examples to process in parallel when evaluating on the development set 30
--numpy-seed NumPy random seed Random
--use-words Use learned word embeddings Do not use word embeddings
--use-tags Use predicted part-of-speech tags as input Do not use predicted tags
--use-chars-lstm Use learned CharLSTM word representations Do not use CharLSTM
--use-elmo Use pre-trained ELMo word representations Do not use ELMo
--use-bert Use pre-trained BERT word representations Do not use BERT
--use-xlnet Use pre-trained XLNet word representations Do not use XLNet
--pad-left When using pre-trained XLNet padding on left Do not pad on left
--bert-model Pre-trained BERT model to use if --use-bert is passed bert-large-uncased
--no-bert-do-lower-case Instructs the BERT tokenizer to retain case information (setting should match the BERT model in use) Perform lowercasing
--xlnet-model Pre-trained XLNet model to use if --use-xlnet is passed xlnet-large-cased
--no-xlnet-do-lower-case Instructs the XLNet tokenizer to retain case information (setting should match the XLNet model in use) Perform uppercasing
--const-lada Lambda weight 0.5
--model-name Name of model test
--embedding-path Path to pre-trained embedding N/A
--embedding-type Pre-trained embedding type glove
--dataset Dataset type ptb

Additional arguments are available for other hyperparameters; see make_hparams() in src/main.py. These can be specified on the command line, such as --num-layers 2 (for numerical parameters), --use-tags (for boolean parameters that default to False), or --no-partitioned (for boolean parameters that default to True).

For each development evaluation, the best_dev_score is the sum of F-score and LAS on the development set and compared to the previous best. If the current model is better, the previous model will be deleted and the current model will be saved. The new filename will be derived from the provided model path base and the development best_dev_score.

As an example, after setting the paths for data and embeddings, to train a Joint-Span parser, simply run:

sh run_single.sh

to train a Joint-Span parser with BERT, simply run:

sh run_bert.sh

to train a Joint-Span parser with XLNet, simply run:

sh run_xlnet.sh

Evaluation Instructions

A saved model can be evaluated on a test corpus using the command python src/main.py test ... with the following arguments:

Argument Description Default
--model-path-base Path base of saved model N/A
--evalb-dir Path to EVALB directory EVALB/
--test-ptb-path Path to test constituent parsing data/23.auto.clean
--dep-test-ptb-path Path to test dependency parsing data/ptb_test_3.3.0.sd
--embedding-path Path to pre-trained embedding data/glove.6B.100d.txt.gz
--eval-batch-size Number of examples to process in parallel when evaluating on the test set 100
--dataset Dataset type ptb

As an example, after extracting the pre-trained model, you can evaluate it on the test set using the following command:

sh test.sh

If you want to parse the sentences, after setting the input file and pre-trained model, run following command:

sh parse.sh

Citation

If you use this software for research, please cite our paper as follows:

@inproceedings{zhou-zhao-2019-head,
    title = "Head-Driven Phrase Structure Grammar Parsing on {P}enn Treebank",
    author = "Zhou, Junru  and Zhao, Hai",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
}

Credits

The code in this repository and portions of this README are based on https://github.com/nikitakit/self-attentive-parser