ContentMine / journal-scrapers

Journal scraper definitions for the ContentMine framework
66 stars 33 forks source link

download links not present on page #38

Open cnjr2 opened 9 years ago

cnjr2 commented 9 years ago

It would be great to be able to download content from a page for which there are no direct links.

For example, at a given webpage (e.g. www.paper.com) there are some links to low resolution images:

<img src="/foo/carousel/bar/image1.jpg" class="figure"></img>
<img src="/foo/carousel/bar/image2.jpg" class="figure"></img>
<img src="/foo/carousel/bar/image3.jpg" class="figure"></img>

I want to get the high resolution version, and I know their location:

www.paper.com/foo/images/bar/image1.jpg
www.paper.com/foo/images/bar/image2.jpg
www.paper.com/foo/images/bar/image3.jpg

I would like to be able to replace carousel by images (with XPath replace() for example) and then just follow the link to download the image:

"figure": {
  "selector": "replace(//img[@class='figure'], 'carousel', 'images')",
  "download": true
}
tarrow commented 7 years ago

I think this is a sub-issue of #16.