ericleasemorgan / reader

Distant Reader, a tool for using & understanding a corpus
GNU General Public License v2.0
20 stars 7 forks source link

Amalgamate cord:bin/json2txt-pdf.sh and cord:/bin/json2txt-pmc.sh #79

Closed ericleasemorgan closed 4 years ago

ericleasemorgan commented 4 years ago

Amalgamate cord:bin/json2txt-pdf.sh and cord:/bin/json2txt-pmc.sh

json2txt-pdf.sh and json2txt-pmc.sh each take a specifically shaped JSON file as input, and they output a pseudo-article which is amenable to natural language processing. They scripts are located here:

Your mission, if you choose to accept it, is to combine these two scripts into a single script and optionally enhance the output.

All the JSON files are in /export/cord/json. The files beginning with "P" are "PMC" files. Everything else are "PDF" files. Be forewarned. The json directory contains 128,000 files. Listing is impractical.

The difference between the two scripts is trivial. json2txt-pdf.sh uses a sha value as a key. json2txt-pmc.sh uses a pmc_id as a key. This difference is manifested in the constant called "TEMPLATE". Create a new script, json2txt.sh, and allow it to take two command line arguments. The first is the path to a JSON file, and the second is the type of JSON file it is ("pdf" or "pmc"). I suppose one could look at the shape of the .paper_id value and branch accordingly, and thus eliminate the need for the second argument. Heck, one could examine the JSON file's name and branch accordingly as well. The choice is yours.

For extra credit, try to improve the plain text output so it includes tables, section headings, references, etc. These types of content is denoted in the JSON files with sets of labels. I'm not sure, but the difference between the two JSON files is minor. Their structures are documented here, sort of:

https://discourse.cord-19.semanticscholar.org/t/faqs-about-cord-19-dataset/94

Some of the JSON files, include HTML mark-up. :-( Consider exploiting that also, but dont' fret about it too much.

To do this work you will need to either ssh to our cluster, or you will need to download the SQLite database file cord:/etc/cord.db. The database file is about a GB in size.

'Make sense?

Good luck. And this message will self-destruct in five seconds..... [Puff]

rdoughty commented 4 years ago

@ericleasemorgan for the "extra credit" what kind of markup language do you want? markdown? html? something else?

Amalgamating the scripts is complete - https://github.com/ericleasemorgan/cord-19/compare/json2txt

ericleasemorgan commented 4 years ago

I have not been ignoring you, I promise.

Markup? None, just plain text.

Soon, I will look at the pull request more closely.

ericleasemorgan commented 4 years ago

I have merged the json2txt.sh script into the repository, and I think we can call his "done".