Closed markmacgillivray closed 9 years ago
Another example, for a different journal (PNAS) http://dx.doi.org/10.1073/pnas.1421379112
quickscrape --urllist recentpnas/fulltext_html_urls.txt --scraper journal-scrapers/scrapers/pnas.json --output recentpnasfull
recentpnasfull/
├── http_dx.doi.org_10.1073_pnas.1421379112
│ ├── 1421379112.abstract
│ ├── 1421379112.full.pdf
│ └── results.json
you can kind of pass PNAS DOIs output from getpapers to quickscrape BUT importantly the fulltext PDF isn't renamed fulltext.pdf . Also the entire content of results.json is just undefined
@markmacgillivray I ran quickscrape on both of those URLs and got the same files downloaded, fulltext.{html,xml,pdf}
for both. The results.json
was empty for the DOI version, but otherwise the files created were identical.
Working on the json issue now.
@rossmounce the PNAS scraper doesn't specify that the downloaded files should be renamed (see below), so that is unrelated to the DOI URL issue.
"fulltext_pdf": {
"selector": "//meta[@name='citation_pdf_url']",
"attribute": "content",
"download": true
},
"fulltext_html": {
"selector": "//meta[@name='citation_fulltext_html_url']",
"attribute": "content",
"download": true
},
This should be fixed in version 0.4.6
- please update and test
For example, this PLOS One paper:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130007
works fine in quickscrape and delivers fulltext pdf, html and xml.
But getting via the DOI appears to resolve and processes, but only returns the pdf:
http://dx.doi.org/10.1371/journal.pone.0130007
I would have thought either the redirect would fail completely, or succeed and return exactly the same stuff, but it does not...