ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Generate scraping addresses from URLs or other identfiers #40

Open petermr opened 9 years ago

petermr commented 9 years ago

Some publishers have "hidden" or "deeply nested" URLs that do not occur on the landing page. For example Hindawi advertises:

<a href="http://downloads.hindawi.com/journals/ija/2015/426387.epub" class="full_text_epub">
                            Full-Text ePUB</a>

to download ePUB, but there is no explicit XML link. However the analogous:

<a href="http://downloads.hindawi.com/journals/ija/2015/426387.xml" class="full_text_xml">Full-Text XML</a> 

works. Issue is to create a syntax for generating such addresses from information scraped from the landing page.

blahah commented 9 years ago

This is a sub-issue of #16 which would create a plugin system for post-processing of captured elements. Under this framework we could have something like:

"fulltext_xml": {
  "selector": "//a[contains(concat(' ', normalize-space(@class), ' '), ' full_text_epub ']",
  "attribute": "href",
  "process": [
    {
      "processor": "replace",
      "arguments": {
        "pattern": "epub",
        "replacement": "xml"
      }
    }
  ],
  "download": {
    "rename": "fulltext.xml"
  }
}

The process array could have any number of elements in it, and they would be executed in order, before any downloading or link-following occurred.

chartgerink commented 8 years ago

Possibly similar/related to this is the downloading of pdfs that are not directly linked to (such as in the attached screenshot). The pdf itself is then nested in that page, and is not the entire page. If the link from the original page is downloaded, the pdf does not work because it also downloads html alongside it.

screenshot 2015-10-02 10 43 54

Any way to select the initial link, and then select the link to the pdf from the new page? (as in example)

blahah commented 8 years ago

@chartgerink this case is handled by scraperJSON already using the 'follow' feature. You first capture the link to the PDF page as an element, then you have another element that 'follows' that element, meaning the link is followed and the following element is extracted from the resulting page.

Simple example that should work for Developmental Science (if you fill in the selector for pdf_container):

{
  "url": "onlinelibrary.wiley.com",
  "followables": {
    "pdf_container": {
      "selector": "//a[text()='link to the pdf container page']",
      "attribute": "href"
    }
  },
  "elements": {
    "fulltext_pdf": {
      "selector": "//embed[@name='plugin']",
      "attribute": "src",
      "follow": "pdf_container",
      "download": {
        "rename": "fulltext.pdf"
      }
    }
  }
}

Note that in this case pdf_container has been put in followables rather than elements, so that it is excluded from the results. However, you can also follow any element in elements so that both the followed element and the one following it will be in the results.

chartgerink commented 8 years ago

*Edit: I got it to work. Somehow the selector did not function for the plugin, but the link I needed was provided elsewhere. Old text retained for docs.

Hm I cannot get it to work. The page renders but quickscrape keeps returning that the selector had no results. I have tried using the direct XPath; also tried the direct link to the container as scrapelink, but to no avail. When using the direct link it does extract other available properties so it is not as if it was just plainly blocked on that page. Do you maybe have additional ideas?

Ps. Thanks for providing such a fast response, Richard!

capture