ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 42 forks source link

DOI resolution gives different results to the direct URL #54

Closed markmacgillivray closed 9 years ago

markmacgillivray commented 9 years ago

For example, this PLOS One paper:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130007

works fine in quickscrape and delivers fulltext pdf, html and xml.

But getting via the DOI appears to resolve and processes, but only returns the pdf:

http://dx.doi.org/10.1371/journal.pone.0130007

I would have thought either the redirect would fail completely, or succeed and return exactly the same stuff, but it does not...

rossmounce commented 9 years ago

Another example, for a different journal (PNAS) http://dx.doi.org/10.1073/pnas.1421379112

quickscrape --urllist recentpnas/fulltext_html_urls.txt --scraper journal-scrapers/scrapers/pnas.json --output recentpnasfull
recentpnasfull/
├── http_dx.doi.org_10.1073_pnas.1421379112
│   ├── 1421379112.abstract
│   ├── 1421379112.full.pdf
│   └── results.json

you can kind of pass PNAS DOIs output from getpapers to quickscrape BUT importantly the fulltext PDF isn't renamed fulltext.pdf . Also the entire content of results.json is just undefined

blahah commented 9 years ago

@markmacgillivray I ran quickscrape on both of those URLs and got the same files downloaded, fulltext.{html,xml,pdf} for both. The results.json was empty for the DOI version, but otherwise the files created were identical.

Working on the json issue now.

blahah commented 9 years ago

@rossmounce the PNAS scraper doesn't specify that the downloaded files should be renamed (see below), so that is unrelated to the DOI URL issue.

    "fulltext_pdf": {
      "selector": "//meta[@name='citation_pdf_url']",
      "attribute": "content",
      "download": true
    },
    "fulltext_html": {
      "selector": "//meta[@name='citation_fulltext_html_url']",
      "attribute": "content",
      "download": true
    },
blahah commented 9 years ago

This should be fixed in version 0.4.6 - please update and test