geogeeks-au / maps-for-lost-towns

Georeferencing historic maps through crowdsourcing
4 stars 1 forks source link

Scraping SRO's digital objects collection #8

Closed keithamoss closed 8 years ago

keithamoss commented 8 years ago

Is there an API? If not, can we get a data dump? If not, we'll scrape.

Will ask SRO.

Next: #11

keithamoss commented 8 years ago

If we are in want of an API...

Let's scrape the digital objects collection from SRO.

Process

Potential Challenges

Test Data

Page https://archive.sro.wa.gov.au/index.php/a-c-gregory-bejoording-to-sturt-river-through-wongan-hills-021

Dublin Core XML https://archive.sro.wa.gov.au/index.php/a-c-gregory-bejoording-to-sturt-river-through-wongan-hills-021;dc?sf_format=xml

Dublin Core Metadata Schema http://www.openarchives.org/OAI/2.0/oai_dc.xsd

Page https://archive.sro.wa.gov.au/index.php/plan-showing-locations-for-the-salvation-army-north-north-east-of-collie-by-n-j-moore-fieldbooks-20-24-scale-20-chains-to-an-inch-wellington-160

Image https://archive.sro.wa.gov.au/uploads/r/srowa/3/2/32111b70dfa1d0759802feedc15efefc391474bb4509a1d02999ea229e8346be/Cons_3869_Wellington_160.jpg

Dublin Core XML https://archive.sro.wa.gov.au/index.php/plan-showing-locations-for-the-salvation-army-north-north-east-of-collie-by-n-j-moore-fieldbooks-20-24-scale-20-chains-to-an-inch-wellington-160;dc?sf_format=xml

samwilson commented 8 years ago

I guess it's unlikely that they'll upgrade to the latest AtoM, that has an API?

keithamoss commented 8 years ago

In the short term, certainly unlikely, @samwilson.

They've suggested they could send us a database dump in the next couple of weeks though - so we may not have to scrape at all!

samwilson commented 8 years ago

Oh that's cool! :)

keithamoss commented 8 years ago

Checkout https://github.com/geogeeks-au/maps-for-lost-towns/blob/master/scrapers/SRO%20Digital%20Objects%20Scraper.ipynb for a partially completed scraper.

keithamoss commented 8 years ago

The scraper has finished running against SRO! We've now got all 6,745 maps in the database in the new sro_digital_objects_collection table.

Code is here: https://github.com/geogeeks-au/maps-for-lost-towns/blob/master/scrapers/SRO%20Digital%20Objects%20Scraper.ipynb CSV dump is here: https://github.com/geogeeks-au/maps-for-lost-towns/blob/master/scrapers/sro_digital_objects_collection.csv

screen shot 2016-04-29 at 16 06 43