ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Relative vs. Absolute href on journal pages #28

Closed pbulsink closed 9 years ago

pbulsink commented 10 years ago

I'm trying to write scrapers for ACS and Nature, and they use relative links in their pages. The scraper doesn't appear to follow these relative links.

eg: for http://pubs.acs.org/doi/abstract/10.1021/ja409271s the fulltext pdf link is <a title="Download the PDF Full Text" href="/doi/pdf/10.1021/ja409271s">, this is not followed by the json:

...
    "fulltext_pdf": {
      "selector": "//a[@title='Download the PDF Full Text']",
      "attribute": "href",
      "download": true
    },
...
blahah commented 10 years ago

Thanks for the report, this will be fixed in the next release

pbulsink commented 10 years ago

A few publishers have the full links in the head as a meta tag, with the html link as content:

<meta xmlns="http://www.w3.org/1999/xhtml" name="citation_fulltext_html_url" content="http://onlinelibrary.wiley.com/doi/10.1002/anie.200501671/full" />

Some way of following that might be more exact than piecing together the url from the relative locations.

blahah commented 10 years ago

Yes, it's advisable to always use the meta tags if what you're looking for is there. This is what all the existing journal-scrapers do.

blahah commented 9 years ago

Relative link resolution is now included