IBM / science-result-extractor

Apache License 2.0
92 stars 17 forks source link
ibm-research ibm-research-ai information-extraction nlp pdf-document-processor scientific-papers table-extraction

Science-result-extractor

Introduction

This repository contains code and a few datasets to extract TDMS (Task, Dataset, Metric, Score) tuples from scientific papers in the NLP domain. We envision three primary uses for this repository: (1) to extract table content from PDF files, (2) to replicate the paper's results or run experiments based on a textual entailment system, and (3) to train a model to extract TDM mentions. Please refer to the following paper for the full details:

Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, Debasis Ganguly. Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 27 July - 2 August 2019

Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, Debasis Ganguly. TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. In Proceedings of the 16th conference of the European Chapter of the Association for Computational Linguistics (EACL 2021), Online, 19-23 April 2021

Extract table content from PDF files

We developed a deterministic PDF table parser based on GROBID. To use our parser, follow the steps below:

1) Fork and clone this repository, e.g.,

> git clone https://github.com/IBM/science-result-extractor.git

2) Download and install GROBID 0.5.3, following the installation instructions, e.g.,

> wget https://github.com/kermitt2/grobid/archive/0.5.3.zip
> unzip 0.5.3.zip
> cd grobid-0.5.3/
> ./gradlew clean install

(note that gradlew must be installed beforehand)

3) Configure pGrobidHome and pGrobidProperties in config.properties. The default configuration assumes that GROBID directory grobid-0.5.3 is a sister of the science-result-extractor directory.

pGrobidHome=../../grobid-0.5.3/grobid-home
pGrobidProperties=../../grobid-0.5.3/grobid-home/config/grobid.properties

4) PdfInforExtractor provides methods to extract section content and table content from a given PDF file.

Run experiments based on textual entailment system

We release the training/testing datasets for all experiments described in the paper. You can find them under the data/exp directory. The results reported in the paper are based on the datasets under the data/exp/few-shot-setup/NLP-TDMS/paperVersion directory. We later further clean the datasets (e.g., remove five pdf files from the testing datasets which appear in the training datasets with a different name) and the clean version is under the data/exp/few-shot-setup/NLP-TDMS folder. Below we illustrate how to run experiments on the NLP-TDSM dataset in the few-shot setup to extract TDM pairs.

1) Fork and clone this repository.

2) Download or clone BERT.

3) Copy run_classifier_sci.py into the BERT directory.

3) Download BERT embeddings. We use the base uncased models.

4) If we use BERT_DIR to point to the directory with the embeddings and DATA_DIR to point to the directory with our train and test data, we can run the textual entailment system with run_classifier_sci.py. For example:

> DATA_DIR=../data/exp/few-shot-setup/NLP-TDMS/
> BERT_DIR=./model/uncased_L-12_H-768_A-12/
> python3 run_classifier_sci.py --do_train=true --do_eval=false --do_predict=true --data_dir=${DATA_DIR} --task_name=sci --vocab_file=${BERT_DIR}/vocab.txt --bert_config_file=${BERT_DIR}/bert_config.json --init_checkpoint=${BERT_DIR}/bert_model.ckpt --output_dir=bert_tdms --max_seq_length=512 --train_batch_size=6 --predict_batch_size=6

5) TEModelEvalOnNLPTDMS provides methods to evaluate TDMS tuples extraction.

6) GenerateTestDataOnPDFPapers provides methods to generate testing dataset for any PDF papers.

Read NLP-TDMS and ARC-PDN corpora

1) Follow the instructions in the README in data/NLP-TDMS/downloader/ to download the entire collection of raw PDFs of the NLP-TDMS dataset. The downloaded PDFs can be moved to data/NLP-TDMS/pdfFile (i.e., mv *.pdf ../pdfFile/.).

2) For the ARC-PDN corpus, the original pdf files can be downloaded from the ACL Anthology Reference Corpus (Version 20160301). We use papers from ACL(P)/EMNLP(D)/NAACL(N) between 2010 and 2015. After uncompressing the downloaded PDF files, put the PDF files into the corresponding directories under the /data/ARC-PDN/ folder, e.g., copy D10 to /data/ARC-PDN/D/D10.

3) We release the parsed NLP-TDMS and ARC-PDN corpora. NlpTDMSReader and ArcPDNReader in the corpus package illustrate how to read section and table contents from PDF files in these two corpora.

train a model to extract TDM mentions

We release the TDMSci corpus (under the data folder). The dataset is in the standard CoNLL format.

Citing science-result-extractor

Please cite the following paper when using science-result-extractor:

@inproceedings{houyufang2019acl,
  title={Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction},
  author={Hou, Yufang and Jochim, Charles and Gleize, Martin and Bonin, Francesca and Ganguly, Debasis},
  booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, {\em Florence, Italy, 27 July -- 2 August 2019}},
  year      = {2019}
}

@inproceedings{houyufang2021eacl,
  title={TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics},
  author={Hou, Yufang and Jochim, Charles and Gleize, Martin and Bonin, Francesca and Ganguly, Debasis},
  booktitle = {Proceedings of the  the 16th conference of the European Chapter of the Association for Computational Linguistics, {\em Online, 19--23 April 2021}},
  year      = {2021}
}