D61-IA / stellar-gnosis

Gnosis paper management and collaboration tool
Apache License 2.0
0 stars 1 forks source link

Prepare NeurIPS data on Kaggle for Gnosis DB import #39

Closed PantelisElinas closed 4 years ago

PantelisElinas commented 4 years ago

The paper data for the NeurIPS conference from 1987 to 2017 has been scraped and uploaded to Kaggle at https://www.kaggle.com/benhamner/nips-papers

The data includes the paper title, abstract and authors as well as the full paper text.

About half the papers are missing the abstract but it would appear that the abstract is included in the extracted text.

This ticket is about about preparing the data for import into Gnosis DB by extracting the abstract from the paper text for those papers that are tagged with "Missing Abstract".

We need a Python script that will read the Kaggle data and output a csv file with one column for each necessary field in Paper model: title, abstract, download_link, source_link (what is this used for?) and authors (comma separated list of author names). The authors are not part of the Paper model but we need to create Person models to associate with each paper.

For those papers that have "Missing" abstract, we should try and extract the abstract from the paper text data included in the Kaggle dataset.