earlng / academic-pdf-scrap

Code that scraps the contents of the PDF papers submitted for NeurIPS 2020
MIT License
4 stars 2 forks source link

Merge dataframe with second data (authors, institutions, countries) #3

Open paulsedille opened 3 years ago

paulsedille commented 3 years ago

Is your feature request related to a problem? Please describe. The PDF formatting makes it difficult to scrap the authors and their institutions from the XML. Fortunately, there is another repository of the articles that makes this easier, and even more fortunately, someone has already done the hard work of scraping it with python, as well as adding for many institutions their country of affiliation, here: https://github.com/nd7141/icml2020

Describe the solution you'd like Can the authors+institutions+countries data scraped by the above github user be collated into our dataframe, and output in a single csv file?

Describe alternatives you've considered Will need to look into this!