We need means to analyze the "quality" for the more popular journal article publishers. In other words, we need a classifier based on the publisher (ScienceDirect, PubMed, OUP, etc.) for how likely the entries in publication partitions will "survive" all the way through our workflow to successful PDF parsing.
Methodology:
analyze the distribution of publishers among the entries in partitions in bucket_final by resolving doi fields to URLs, then fetching those to determine their DNS domain
what's the distribution of how many fail to have open access PDFs? use a title match on errors/misses_final.txt
what's the distribution of how many PDFs fail to download? see rclc/errors.txt
what's the distribution of how many PDFs fail to be parsed, for text extraction?
We need means to analyze the "quality" for the more popular journal article publishers. In other words, we need a classifier based on the publisher (ScienceDirect, PubMed, OUP, etc.) for how likely the entries in publication partitions will "survive" all the way through our workflow to successful PDF parsing.
Methodology:
doi
fields to URLs, then fetching those to determine their DNS domainDelivery: