analysis: publisher classifier

We need means to analyze the "quality" for the more popular journal article publishers. In other words, we need a classifier based on the publisher (ScienceDirect, PubMed, OUP, etc.) for how likely the entries in publication partitions will "survive" all the way through our workflow to successful PDF parsing.

Methodology:

analyze the distribution of publishers among the entries in partitions in bucket_final by resolving doi fields to URLs, then fetching those to determine their DNS domain
what's the distribution of how many fail to have open access PDFs? use a title match on errors/misses_final.txt
what's the distribution of how many PDFs fail to download? see rclc/errors.txt
what's the distribution of how many PDFs fail to be parsed, for text extraction?

Delivery:

results are best visualized and packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
1. later we'll move the analysis into an additional workflow step.

Coleridge-Initiative / RCGraph

analysis: publisher classifier #64