ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Follow links #14

Closed noamross closed 9 years ago

noamross commented 10 years ago

In writing a similar tool, I found that sometimes the information for a single document was provided on multiple pages, and that I needed to follow links within the page to get all metadata. This might look like this is in the scraper definition:

  "url": "\\w+\\.\\w+",
  "follow-links":  {
      "article_info":  {
         "selector":  "//meta[@name='article_info'_url]
         "attribute": "content"
          }
       }
  "elements": {
    "funder": {
      "selector": "//span[@name='funding_source']",
      "page": "article_info"
     }
  }

The scraper then also opens the pages whose URLs are collected in "follow-links", and those elements with a "page" attribute use those pages rather than the URL provided.

The use case I've seen is some journals have metadata on a different page than the abstract/DOI landing page.

blahah commented 10 years ago

That's a nice idea, thanks. Have you got any example sites?

noamross commented 10 years ago

For Springer journals, the DOI landing page is always the abstract, not the full text (e.g., http://dx.doi.org/10.1007/s10021-014-9786-0). This page has most of the metadata, but if you want images, you'll need to follow @meta[name="citation_fulltext_html_url"], "content".

For JSTOR, if you have full-text access, the DOI will land you on the full-text page (http://www.jstor.org/stable/10.1086/598847) , while some metadata is only on the "Summary" page, which is what you'd land on if you don't have access. So you have to follow //ul[@class="menu Pagemenu"]/li, get all the 'href' attributes from those nodes, and regex the one with 'stable/info'. (I note that this is a relative URL, while the Springer on above is absolute.) Similarly, there's a link the a "Media" in the same place which may be a better place to extract images from.

blahah commented 10 years ago

Excellent, thanks.

This works nicely for the JSTOR case, no regex required:

//ul[contains(@class, 'pageMenu')]/li/a[text()='Summary']/@href
noamross commented 10 years ago

I didn't know about text(), thanks!

blahah commented 10 years ago

Implemented (with various other features) in thresher - quickscrape will bring in these changes imminently.