D61-IA / stellar-gnosis

Gnosis paper management and collaboration tool
Apache License 2.0
0 stars 1 forks source link

Feature/kaggle import #51

Closed Zhenghao-Zhao closed 4 years ago

Zhenghao-Zhao commented 4 years ago

Features: Processes and uploads csv data from https://www.kaggle.com/benhamner/nips-papers to designated Postgres DB.

Zhenghao-Zhao commented 4 years ago

Notice the order of authors from top to bottom as shown in paper_authors.csv. I found an instance where this is not accurate: (Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

Zhenghao-Zhao commented 4 years ago

I have not more updates in this branch.

Zhenghao-Zhao commented 4 years ago

Script updated according to changes in Person and Paper models.

Zhenghao-Zhao commented 4 years ago

I propose an alternative solution to import nips papers from NIPS'87 to NIPS'18.

The newly published acm website has all NIPS papers. We could run a script to scrap directly from their website.

https://dl.acm.org/conference/nips/proceedings

PantelisElinas commented 4 years ago

Hi @Zhenghao-Zhao,

I just had a look at this, and your script does not work as the original ticket #39 requires. In that ticket I said,

We need a Python script that will read the Kaggle data and output a csv file with one column for each necessary field in Paper model: title, abstract, download_link, source_link (what is this used for?) and authors (comma separated list of author names).

Your solution does not create this file. I need the data in the above format so that I can combine it with the arXiv data and do a single upload to the database. I have created the script scripts/import_from_csv.py to load the data from a correctly formatted csv file.

Please update your script to output the data in the required format. You can keep the code to upload the data into the database, however, please make this step optional as the import_from_csv.py script makes an effort to find the "unique" author names before creating the Person models. It won't work if some data have already been loaded into the db.

Regards,

P.

Zhenghao-Zhao commented 4 years ago

Changes made to strictly output csv with the required columns. The name of the output file changed to 'csv_result'.

PantelisElinas commented 4 years ago

Hi @Zhenghao-Zhao

thank you for updating. I will merge your code and close the ticket.

P.