What is the purpose of the file fulltext_html_urls.txt

anusharanganathan commented 9 years ago

What is the purpose of the file _fulltext_htmlurls.txt available as a part of the output?

Purpose: Search open access papers in eupmc for the query dinosaurs and download fulltext XMLs, supplementary files and fulltext PDFs if available

Query used

$ getpapers -q 'dinosaurs' -x -s -p -o dinosaursOutput2 >> dinosaursOutput2.log

This generated a _fulltext_htmlurls.txt file with 22 urls

Not all pmids listed in _fulltext_htmlurls.txt had a corresponding fulltext.xml or fulltext.html file downloaded. Of the 22 urls with pmcids listed in the file, the breakdown of what I found was as follows:

20 of the pmcids had an empty dir
2 of the pmcids had a dir with a fulltext.xml file but an empty fulltext.html file
For each of the pmcids in the _fulltext_htmlurls.txt file, the output produced a message similar to the following one
warn: Article with pmcid "PMC3381548" had no fulltext PDF url

blahah commented 8 years ago

the fulltext HTML file is just a list of the fulltext HTML urls that were available. I'm moving it to an --html option so that users can request the HTML to be downloaded, and there will no longer be a fulltex_html_urls.txt file

blahah commented 8 years ago

done in 0.4.1

ContentMine / getpapers

What is the purpose of the file fulltext_html_urls.txt #53