CoIArt / VVG-Gallery-Scrapy

Vincent Van Gogh's museum gallery scraping code to create a ML model dataset
MIT License
0 stars 0 forks source link

Scrape high resolution drawings #1

Open roisantos opened 2 years ago

roisantos commented 2 years ago

Hi

Thank you for publishing this code. Could you please indicate me how to scrape some high resolution drawings?

phillipus85 commented 2 years ago

Hello,

Thanks for looking at the code, still lots of work to do.

The functionality for scraping high resolution drawings is not implemented yet. Nevertheless after studing the museum HTML code I can see two options.

  1. Requesting HR files from the collection index (i.e.: Gallery Index)
  2. Requesting HR files from each element (i.e.: Head of a Woman)

The first option comes when you scrap the index data for each image in the gallery. Here, you can find the tag <picture><source data-srcset="https://..."</picture> and inside it you can find all the resolution references links. from the 200pix to the 1500pix.

The following image shows you the details:

image

The second option comes in each gallery element, in here as in the first option each drawing has several *.jpg files from 200pix to 1500pix. The difference is that it is requested with a JavaScript code when you use the Zoom option in the HTML.

Again, the next image show you what I described:

image

Both options need some dev and testing, but the idea is that you should request (via REST service) the appropiate data-srcset=... according to the resolution you want (i.e.: 600w, 900w, 1200w or 1500w), right now I believe the HD file of one drawing is not a single one but many. You should be able to save all the canvas pieces when you make a request .

data source sets example from the <picture>...</picture>:

Data-srcset="https://micrio.vangoghmuseum.nl/iiif/JPkhs/full/600,/0/default.jpg?hash=83pjGNh_eTB_aNKUhtAR0gELWtOS4_9c55oKrNC7CkA 600w,https://micrio.vangoghmuseum.nl/iiif/JPkhs/full/900,/0/default.jpg?hash=e3RvyvzL0uqgL5_5AOG-kwyDamx_Ebmm7Ed_GNgdKbs 900w,https://micrio.vangoghmuseum.nl/iiif/JPkhs/full/1200,/0/default.jpg?hash=2pSGGpStG-0K6hu8286eaaque6UTgv82DwOzhwqj1nE 1200w,https://micrio.vangoghmuseum.nl/iiif/JPkhs/full/1500,/0/default.jpg?hash=eWFEPmlpRQ1Q-fMs1C_yHq0gLi87I1AO_1XbbU8ufM8 1500w,https://micrio.vangoghmuseum.nl/iiif/JPkhs/full/1800,/0/default.jpg?hash=043Lxb0nmhwnHFDHAUqtI8W2ROc0MWZ0xVgxAcF5xPk 1800w"

Sorry is not functional code, for now I can only maintain this project in my spare time.