Closed petermr closed 8 years ago
Confirmed that this is a bug:
~/c/g/t/ursus master ❯ jq '.[].fullTextUrlList[].fullTextUrl[] | select(.documentStyle[] == "html" and .availabilityCode[] == "OA").url[]' test_out/ursus/eupmc_results.json | wc -l
100
~/c/g/t/ursus master ❯ cat fulltext_html_urls.txt
http://europepmc.org/articles/PMC4459689
http://europepmc.org/articles/PMC4413493
http://europepmc.org/articles/PMC4289921
Only three URLs in the URL file, but 100 OA html fulltext URLs in the results JSON. Investigating now. Note that in the new version of getpapers there's an option to directly output the html files as already happens with XML and PDF.
Just to clarify - by 'new version' I mean the one I am working on that is not yet released. It will be a v1 alpha.
Fixed in version 0.4.1
[Not sure if this is related to #57]
When running without
-x
or-p
the default is HTML. A typical search of OA EPMC announces many open access papers but only "downloads" a trivial fraction (5%).There are only 3 URLs in the
junk3/fulltext_html_urls.txt
These do, indeed have fulltext HTML. But manual inspection of the others shows that they also have fulltext HTML. This may be a EPMC bug/feature, but it means that using the
fulltext_html_urls.txt
inquickscrape
misses almost all the papers.In contrast the options
-x
and-p
seem to give the expected 100 (deduplicated) papers.