gcerretani / antenati

Tools to download data from Portale Antenati
MIT License
27 stars 9 forks source link

Antenati update overnight changed the URL's #11

Closed jbellanca closed 2 years ago

jbellanca commented 2 years ago

There was an antenatal update last night. The new URL's are in this format: https://www.antenati.san.beniculturali.it/ark:/12657/an_ua36075040?lang=en

It looks like you can still extract the numerical code after the "an_ua" and use it in the old URL format, like: https://www.antenati.san.beniculturali.it/detail-registry/?s_id=36075040 but the IIIF manifest appears to be unavailable.

The IIIF manifest for the link above is at the below URL, but appears to be secured: https://dam-antenati.san.beniculturali.it/antenati/containers/046by1e/manifest

Using Chrome, you can find the 7-character identifier code from the manifest for each page, and use the existing content link URL, replacing the 7-character code, to download the page: https://iiif-antenati.san.beniculturali.it/iiif/2/xxxxxxx/full/full/0/default.jpg

But, the script will not longer work without an update. Since it looks like the manifest may not be available, maybe there's a way to programmatically inspect the page to pull all the codes that would be in the manifest, and download them.

I can also give this info: This is one of the older links to an 1860 birth record book: https://www.antenati.san.beniculturali.it/detail-registry/?s_id=1092642&lang=en It translates to this new link: https://www.antenati.san.beniculturali.it/ark:/12657/an_ua1092642/5dgG4e3

EDIT: If I use the Chrome plug-in "Save all resources", it will save all the files loaded, INCLUDING the complete IIIF manifest. In the example page: https://www.antenati.san.beniculturali.it/ark:/12657/an_ua36075040?lang=en Where the manifest is at: https://dam-antenati.san.beniculturali.it/antenati/containers/046by1e/manifest The plug-in will save the file manifest.html, which is the complete IIIF manifest. I can't find the correct URL to load the manifest myself, though. Adding ".html" to the manifest link does not work, even though that's the path given by the plug-in.

gcerretani commented 2 years ago

Very curious this effort to make harder our job. I'll have a look at it, we have to figure out why manifest URL return a 403. Few months ago they added geographic and user agent filters. Let's guess what's going on now.

jbellanca commented 2 years ago

Yeah they're being tricky about it by some sort of filter or something. I went to that example URL above, and using the SaveAllResources plug-in with the option "Include all assets by XHR requests", I was able to save the manifest, but they're definitely doing something to make it harder. For that example, in the "dam-antenati.san.beniculturali" folder in this zip: https://www.dropbox.com/s/004yjx06omtnf1q/www.antenati.san.beniculturali.it.zip Btw, thanks for all your hard work on this tool, it's awesome and has been a lifesaver for me in my research!

gcerretani commented 2 years ago

I compared the headers sent by the browser during a standard session, with the headers sent by entering the manifest URL directly in the URL bar. The main difference was the presence of these two headers in the standard case:

    referer: https://www.antenati.san.beniculturali.it/
    origin: https://www.antenati.san.beniculturali.it

Seems that "referer" is enough to fix this issue. Thanks a lot, @jbellanca.

gcerretani commented 2 years ago

Interestingly, User-Agent seems not needed anymore. I've kept it, adding also Origin, just in case new filters are added in the future. See fa965f22908994a82b1d039ba9d004a519d0005b

jbellanca commented 2 years ago

Works perfect again - thanks!!! You're the best!

gmalcolms commented 2 years ago

Thanks for finding a work-around for their new security. I had made a similar program many years ago in Excel - one that allows you to just list comuni, record types, and years, and the program would go through the website to find each collection before downloading it, without even having to open a browser - but it stopped working since they added password protection to the manifest. Adding the headers as you found still does not work in VBA either, because XHR uses javascript, so I wrote a python program that downloads the html of the manifest given its URL, and then wrapped that in a C++ dll. My Excel program is working perfectly again. Thank you! Now I'm off to download all the record collections in several provinces before they fix the current security hole.