geogeeks-au / maps-for-lost-towns

Georeferencing historic maps through crowdsourcing
4 stars 1 forks source link

Close-enough geocoding map titles #9

Closed keithamoss closed 8 years ago

keithamoss commented 8 years ago

e.g. Working out that

AU WA S236- cons3869 Wellington 160 Plan showing locations for the Salvation Army (North-North-East of Collie) by N.J. Moore, Fieldbooks 20-24 [scale: 20 chains to an inch].

Should be at least roughly associated with the Swan land division, and preferably knowing that it's located around Collie.

So that's going to be fun!

Dependencies: #8 Next: #12 Related: #5

OK, now finally let's try to assign a spatial location to these

They seem to be categorised by the land divisions - which are noted in the titles (e.g. Wellington, Swan) - so that can at least get us down to a rough spatial location.

Maybe from there we can pull town and place names from the titles and do some rough matching against the known location of the land division? Think about how we can do that!

We could be lazy and just compare against the Gazetteer of Australia? Would be nice to use Landgate data though!

(Just because self-interest and maybe a good fluffy open data story)

https://en.wikipedia.org/wiki/Lands_administrative_divisions_of_Western_Australia

Process

keithamoss commented 8 years ago

Mapshaper can hold keyword/tag metadata - could we tag surveyors et cetera

keithamoss commented 8 years ago

OK, so let's have a crack at this land divisions geocoding biso.

SELECT fields->'title' FROM sro_digital_objects_collection

Results from a comparison to data.wa.gov.au's land districts:

Total Matches == 1: 1695
Total Matches > 1: 364
Total Matches == 0: 4686

:(

keithamoss commented 8 years ago

Change of plans - let's use townsite names first. Land Division names are harder to match against the titles, and a few too many of them contain the names of explorers who are also mentioned in titles!

  1. Load WA's list of current and historical townsites into PostgreSQL (Polygons).
  2. Run some naive LIKE queries finding matches in the townsites table.
  3. Add geom_townsites to sro_digital_objects_collection.
  4. Update sro_digital_objects_collection.geom_townsites where there is no geometry yet available in geom_mapindices.

Issues

  1. Decide what to do where matches > 1.
  2. Do a bit of QA to make sure we're not picking up any false positives (e.g. Names of explorers, partial names of other features)

Based on initial estimates this should get us another ~2k maps with a rough location.

keithamoss commented 8 years ago

Next: GEONOMA/Gazetteer?

http://www.ga.gov.au/place-names/index.xhtml

http://www.ga.gov.au/gis/rest/services/topography/Gazetteer_of_Australia/MapServer/8/query?text=test&geometry=&geometryType=esriGeometryEnvelope&inSR=&spatialRel=esriSpatialRelIntersects&relationParam=&objectIds=&where=1+%3D+1&time=&returnCountOnly=true&returnIdsOnly=false&returnGeometry=true&maxAllowableOffset=&outSR=&outFields=&f=html

keithamoss commented 8 years ago

Altering sro_digital_objects_collection table appears to have broken pg01.geogeeks.org :(

keithamoss commented 8 years ago

OK, make that 2,620 maps that we were able to assign to a townsite on the basis of a townsite name appearing once in the title of the map.

SELECT COUNT(*) FROM
    (SELECT fields->>'title', (SELECT COUNT(*) FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') FROM sro_digital_objects_collection WHERE (SELECT COUNT(*) FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') = 1) AS agg;

ALTER TABLE sro_digital_objects_collection ADD COLUMN geom_townsites Geometry(MultiPolygon, 4283);

UPDATE sro_digital_objects_collection SET geom_townsites = (SELECT geom FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') WHERE geom_mapindices IS NULL AND (SELECT COUNT(*) FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') = 1;

But there are still 303 maps where there are two townsite names matches...

SELECT fields->>'title', (SELECT COUNT(*) FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %'), (SELECT string_agg(name, ', ') FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') FROM sro_digital_objects_collection AS doc WHERE (SELECT COUNT(*) FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') > 1;

There's a mix in here that are duped townsite names in townsites, are using the name of the person who made the map ('By so and so', '- So and so'), and some that do just mention two townsite names

If we exclude the names of people and resolve the dupe townsite names we should be able to get about half of these.

Running tally: 3,851 maps with a rough location assigned!

keithamoss commented 8 years ago

61 more through deduping townsite names.

SELECT fields->>'title', 
    (SELECT string_agg(name, ', ') FROM townsites AS t WHERE t.usage_desc = 'Town-site Land Act' AND (upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %')) 
FROM sro_digital_objects_collection AS doc 
WHERE 
    geom_townsites IS NULL AND geom_mapindices IS NULL AND 
    (SELECT COUNT(*) FROM townsites AS t WHERE t.usage_desc = 'Town-site Land Act' AND (upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %')) = 1;

UPDATE sro_digital_objects_collection 
SET geom_townsites = 
    (SELECT geom FROM townsites AS t WHERE t.usage_desc = 'Town-site Land Act' AND (upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %')) 
WHERE 
    geom_townsites IS NULL AND geom_mapindices IS NULL AND 
    (SELECT COUNT(*) FROM townsites AS t WHERE t.usage_desc = 'Town-site Land Act' AND (upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %')) = 1;

Another 76 through trimming of the "By so and so" text in the title.

UPDATE sro_digital_objects_collection SET title_processed = UPPER(split_part(fields->>'title', ' by ', 1));

SELECT title_processed, 
    (SELECT string_agg(name, ', ') FROM townsites AS t WHERE title_processed LIKE t.name || ' %' OR title_processed LIKE '% ' || t.name OR title_processed LIKE '% ' || t.name || ' %') 
FROM sro_digital_objects_collection AS doc 
WHERE 
    geom_townsites IS NULL AND geom_mapindices IS NULL AND 
    (SELECT COUNT(*) FROM townsites AS t WHERE title_processed LIKE t.name || ' %' OR title_processed LIKE '% ' || t.name OR title_processed LIKE '% ' || t.name || ' %') = 1;

UPDATE sro_digital_objects_collection 
SET geom_townsites = (SELECT geom FROM townsites AS t WHERE title_processed LIKE t.name || ' %' OR title_processed LIKE '% ' || t.name OR title_processed LIKE '% ' || t.name || ' %') 
WHERE 
    geom_townsites IS NULL AND geom_mapindices IS NULL AND 
    (SELECT COUNT(*) FROM townsites AS t WHERE title_processed LIKE t.name || ' %' OR title_processed LIKE '% ' || t.name OR title_processed LIKE '% ' || t.name || ' %') = 1;

Running tally: 3,988

keithamoss commented 8 years ago

And 76 more by using the old Land Districts dataset.

SELECT fields->>'title',
    (SELECT string_agg(name, ', ') FROM districts AS t WHERE t.name NOT IN ('NORTH', 'EAST') AND (UPPER(fields->>'title') LIKE t.name || ' %' OR UPPER(fields->>'title') LIKE '% ' || t.name OR UPPER(fields->>'title') LIKE '% ' || t.name || ' %')) 
FROM sro_digital_objects_collection AS doc 
WHERE 
    geom_townsites IS NULL AND geom_mapindices IS NULL AND 
    UPPER(fields->>'title') LIKE '%DISTRICT%' AND
    (SELECT COUNT(*) FROM districts AS t WHERE t.name NOT IN ('NORTH', 'EAST') AND (UPPER(fields->>'title') LIKE t.name || ' %' OR UPPER(fields->>'title') LIKE '% ' || t.name OR UPPER(fields->>'title') LIKE '% ' || t.name || ' %')) = 1;

UPDATE sro_digital_objects_collection 
SET geom_districts = (SELECT geom FROM districts AS t WHERE t.name NOT IN ('NORTH', 'EAST') AND (UPPER(fields->>'title') LIKE t.name || ' %' OR UPPER(fields->>'title') LIKE '% ' || t.name OR UPPER(fields->>'title') LIKE '% ' || t.name || ' %')) 
WHERE 
    geom_townsites IS NULL AND geom_mapindices IS NULL AND 
    UPPER(fields->>'title') LIKE '%DISTRICT%' AND
    (SELECT COUNT(*) FROM districts AS t WHERE t.name NOT IN ('NORTH', 'EAST') AND (UPPER(fields->>'title') LIKE t.name || ' %' OR UPPER(fields->>'title') LIKE '% ' || t.name OR UPPER(fields->>'title') LIKE '% ' || t.name || ' %')) = 1;
keithamoss commented 8 years ago

Final tally: 4,602 projects with a rough location.

That's probably the best we'll get from this approach. The rest of the names are a mix of things that could be roughly assigned a location by a human, but generally contain too many names of places and things to be done programmatically.

So all up we have 68% of projects with a rough spatial location assigned.

Once we know more about how we'll be using this information we'll probably want to process it a bit more e.g. Reducing these MultiPolygon geometries to bounding box envelopes or centroid points.