ant-research / StructuredLM_RTDT

A library for building hierarchical text representation and corresponding downstream applications.
Apache License 2.0
76 stars 15 forks source link

Composition Model

This library aims to construct syntactic compositional representations for text in an unsupervised manner. The covered areas may involve interpretability, text encoders, and generative language models.

Milestones

"R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling" (ACL2021), R2D2

Proposing an unsupervised structured encoder able to compose low-level constituents into high-level constituents without gold trees. The learned trees are highly consistent with human-annotated ones. The backbone of the encoder is a neural inside algorithm with heuristic pruning, thus the time and space complexity are both in linear.

"Fast-R2D2: A Pretrained Recursive Neural Network based on Pruned CKY for Grammar Induction and Text Representation". (EMNLP2022),Fast_r2d2

Improve the heuristic pruning module used in R2D2 to model-based pruning.

"A Multi-Grained Self-Interpretable Symbolic-Neural Model For Single/Multi-Labeled Text Classification".(ICLR 2023), self-interpretable classification

We explore the interpretability of the structured encoder and find that the induced alignment between labels and spans is highly consistent with human rationality.

"Augmenting Transformers with Recursively Composed Multi-Grained Representations". (ICLR 2024) ReCAT

We reduce the space complexity of the deep inside-outside algorithm from cubic to linear and further reduce the parallel time complexity to approximately log N thanks to the new pruning algorithm proposed in this paper. Furthermore, we find that joint pre-training of Transformers and composition models can enhance a variety of NLP downstream tasks. We push unsupervised constituency parsing performance to 65% and demonstrate that our model could outperform vanillar Trasformers around 5% on span-level tasks.

"Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale". (ACL2024) (current main branch)

We propose GPST, a syntactic language model which could be pre-trained on raw text efficiently without any human-annotated trees. When GPST and GPT-2 are both pre-trained on OpenWebText from scratch, GPST can outperform GPT-2 on various downstream tasks. Moreover, it significantly surpasses previous methods on generative grammar induction tasks, exhibiting a high degree of consistency with human syntax.

Overview

Trees learned unsupervisedly

Illustration of GPST generation process

Here is an illustration of the syntactic generation process for the sentence "fruit flies like a banana".

Illustration of how the neural inside pass works

Illustration of pruned neural inside pass

Illustration of parallel training of GPST

README

Setup

Compile C++ codes.

python setup.py build_ext --inplace

Corpus preprocessing

Dataset: WikiText-103 and OpenWebText.

Before pre-training, we preprocess corpus by spliting raw texts to sentences, tokenizing them, and converting them into numpy memory-mapped format.

Raw text acquiring:

WikiText-103: https://huggingface.co/datasets/wikitext Download link reference: https://developer.ibm.com/exchanges/data/all/wikitext-103/

OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext Download link reference: https://zenodo.org/records/3834942

Raw text preprocessing: sh scripts/preprocess_corpus.sh

Pre-training

To pretrain GPSTmedium: sh scripts/pretrain_GPST_medium.sh

To pretrain GPSTsmall: sh scripts/pretrain_GPST_small.sh

Downstream Tasks

GLUE

Data Acquiring

GLUE: https://huggingface.co/datasets/nyu-mll/glue Download link reference: https://gluebenchmark.com/tasks/

Scripts

To finetune GPSTmedium on GLUE: sh scripts/finetune_glue_GPST_medium.sh

To finetune GPSTsmall on GLUE: sh scripts/finetune_glue_GPST_small.sh

Summary Tasks

Data Acquiring and Preprocessing

We acquire datasets in parquet format from huggingface and do preprocessing on them.

XSum: https://huggingface.co/datasets/EdinburghNLP/xsum

CNN-DailyMail: https://huggingface.co/datasets/abisee/cnn_dailymail

Gigaword: https://huggingface.co/datasets/Harvard/gigaword

Summary dataset preprocessing: sh scripts/preprocess_summary_dataset.sh

Scripts

To finetune GPSTmedium on Summary Tasks: sh scripts/finetune_summary_GPST_medium.sh

To evaluate finetuned GPSTmedium checkpoints: sh scripts/evaluate_summary_GPST_medium.sh

To finetune GPSTsmall on Summary Tasks: sh scripts/finetune_summary_GPST_small.sh

To evaluate finetuned GPSTsmall checkpoints: sh scripts/evaluate_summary_GPST_small.sh

Grammar Induction

Data Acquiring

WSJ: https://paperswithcode.com/dataset/penn-treebank Download link reference: https://drive.google.com/file/d/1m4ssitfkWcDSxAE6UYidrP6TlUctSG2D/view

We further convert training data to raw text version.

Scripts

To finetune GPSTmedium on Grammar Induction: sh scripts/finetune_grammar_induction_GPST_medium.sh

To evaluate finetuned GPSTmedium checkpoints: sh scripts/evaluate_grammar_induction_GPST_medium.sh then sh scripts/compare_trees.sh

To finetune GPSTsmall on Grammar Induction: sh scripts/finetune_grammar_induction_GPST_small.sh

To evaluate finetuned GPSTsmall checkpoints: sh scripts/evaluate_grammar_induction_GPST_small.sh then sh scripts/compare_trees.sh

For evaluating F1 score on constituency trees, please refer to https://github.com/harvardnlp/compound-pcfg/blob/master/compare_trees.py

Syntactic Generalization

Data Acquiring and Preprocessing

We acquire datasets in json format from github and do preprocessing on them.

Syntactic Generalization test suites: https://github.com/cpllab/syntactic-generalization/tree/master/test_suites/json

Syntactic Generalization test suites preprocessing: sh scripts/preprocess_sg_dataset.sh

Scripts

To evaluate GPSTmedium: sh scripts/evaluate_sg_GPST_medium.sh

To evaluate GPSTsmall: sh scripts/evaluate_sg_GPST_small.sh

Contact

aaron.hx@antgroup.com