ericleasemorgan / reader

Distant Reader, a tool for using & understanding a corpus
GNU General Public License v2.0
20 stars 7 forks source link

Enhance metadata.csv to include fields such as DOI, URL, and abstract #96

Open ericleasemorgan opened 4 years ago

ericleasemorgan commented 4 years ago

Study carrels are initialized with sets of JSON files and an accompanying metadata file called "metadata.csv". This work is done with a script, cord:/bin/initialize-carrel.sh. Presently the metadata file only contains fields for author, title, date, and file name (key). Yet our CORD data includes additional metadata which can be useful. At the very least, this metadata includes DOI, URL, and abstract.

Edit initialize-carrel.sh so the resulting metadata file includes columns for DOI, URL, and abstract. This data comes from the CORD database (cord:/etc/schema.sql and cord:/etc/cord.db). Extracting DOI and abstract ought to be trivial. Since each CORD record may include multiple URLs and some of them are bogus, you will have to join the documents table with the urls table to get values. Something like this:

SELECT d.document_id, u.url FROM documents AS d, urls AS u WHERE d.document_id=u.document_id AND d.document_id='96' AND u.url LIKE 'http%' LIMIT 1;

Make your live easy; start with the extraction of DOI and abstract values. Once that works, extract a single URL, even if there are multiples.

Once this task is completed, we will be able to pass the metadata values along to individual carrels, their search pages, their topic modeling tools, and their bibliographic interfaces. "Fun with bibliographics?"