Closed keithamoss closed 8 years ago
Mapshaper can hold keyword/tag metadata - could we tag surveyors et cetera
OK, so let's have a crack at this land divisions geocoding biso.
SELECT fields->'title' FROM sro_digital_objects_collection
Results from a comparison to data.wa.gov.au's land districts:
Total Matches == 1: 1695
Total Matches > 1: 364
Total Matches == 0: 4686
:(
Change of plans - let's use townsite names first. Land Division names are harder to match against the titles, and a few too many of them contain the names of explorers who are also mentioned in titles!
townsites
table.geom_townsites
to sro_digital_objects_collection
.sro_digital_objects_collection.geom_townsites
where there is no geometry yet available in geom_mapindices.Based on initial estimates this should get us another ~2k maps with a rough location.
Next: GEONOMA/Gazetteer?
Altering sro_digital_objects_collection
table appears to have broken pg01.geogeeks.org :(
OK, make that 2,620 maps that we were able to assign to a townsite on the basis of a townsite name appearing once in the title of the map.
SELECT COUNT(*) FROM
(SELECT fields->>'title', (SELECT COUNT(*) FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') FROM sro_digital_objects_collection WHERE (SELECT COUNT(*) FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') = 1) AS agg;
ALTER TABLE sro_digital_objects_collection ADD COLUMN geom_townsites Geometry(MultiPolygon, 4283);
UPDATE sro_digital_objects_collection SET geom_townsites = (SELECT geom FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') WHERE geom_mapindices IS NULL AND (SELECT COUNT(*) FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') = 1;
But there are still 303 maps where there are two townsite names matches...
SELECT fields->>'title', (SELECT COUNT(*) FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %'), (SELECT string_agg(name, ', ') FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') FROM sro_digital_objects_collection AS doc WHERE (SELECT COUNT(*) FROM townsites AS t WHERE upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %') > 1;
There's a mix in here that are duped townsite names in townsites
, are using the name of the person who made the map ('By so and so', '- So and so'), and some that do just mention two townsite names
If we exclude the names of people and resolve the dupe townsite names we should be able to get about half of these.
Running tally: 3,851 maps with a rough location assigned!
61 more through deduping townsite names.
SELECT fields->>'title',
(SELECT string_agg(name, ', ') FROM townsites AS t WHERE t.usage_desc = 'Town-site Land Act' AND (upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %'))
FROM sro_digital_objects_collection AS doc
WHERE
geom_townsites IS NULL AND geom_mapindices IS NULL AND
(SELECT COUNT(*) FROM townsites AS t WHERE t.usage_desc = 'Town-site Land Act' AND (upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %')) = 1;
UPDATE sro_digital_objects_collection
SET geom_townsites =
(SELECT geom FROM townsites AS t WHERE t.usage_desc = 'Town-site Land Act' AND (upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %'))
WHERE
geom_townsites IS NULL AND geom_mapindices IS NULL AND
(SELECT COUNT(*) FROM townsites AS t WHERE t.usage_desc = 'Town-site Land Act' AND (upper(fields->>'title') LIKE t.name || ' %' OR upper(fields->>'title') LIKE '% ' || t.name OR upper(fields->>'title') LIKE '% ' || t.name || ' %')) = 1;
Another 76 through trimming of the "By so and so" text in the title.
UPDATE sro_digital_objects_collection SET title_processed = UPPER(split_part(fields->>'title', ' by ', 1));
SELECT title_processed,
(SELECT string_agg(name, ', ') FROM townsites AS t WHERE title_processed LIKE t.name || ' %' OR title_processed LIKE '% ' || t.name OR title_processed LIKE '% ' || t.name || ' %')
FROM sro_digital_objects_collection AS doc
WHERE
geom_townsites IS NULL AND geom_mapindices IS NULL AND
(SELECT COUNT(*) FROM townsites AS t WHERE title_processed LIKE t.name || ' %' OR title_processed LIKE '% ' || t.name OR title_processed LIKE '% ' || t.name || ' %') = 1;
UPDATE sro_digital_objects_collection
SET geom_townsites = (SELECT geom FROM townsites AS t WHERE title_processed LIKE t.name || ' %' OR title_processed LIKE '% ' || t.name OR title_processed LIKE '% ' || t.name || ' %')
WHERE
geom_townsites IS NULL AND geom_mapindices IS NULL AND
(SELECT COUNT(*) FROM townsites AS t WHERE title_processed LIKE t.name || ' %' OR title_processed LIKE '% ' || t.name OR title_processed LIKE '% ' || t.name || ' %') = 1;
Running tally: 3,988
And 76 more by using the old Land Districts dataset.
SELECT fields->>'title',
(SELECT string_agg(name, ', ') FROM districts AS t WHERE t.name NOT IN ('NORTH', 'EAST') AND (UPPER(fields->>'title') LIKE t.name || ' %' OR UPPER(fields->>'title') LIKE '% ' || t.name OR UPPER(fields->>'title') LIKE '% ' || t.name || ' %'))
FROM sro_digital_objects_collection AS doc
WHERE
geom_townsites IS NULL AND geom_mapindices IS NULL AND
UPPER(fields->>'title') LIKE '%DISTRICT%' AND
(SELECT COUNT(*) FROM districts AS t WHERE t.name NOT IN ('NORTH', 'EAST') AND (UPPER(fields->>'title') LIKE t.name || ' %' OR UPPER(fields->>'title') LIKE '% ' || t.name OR UPPER(fields->>'title') LIKE '% ' || t.name || ' %')) = 1;
UPDATE sro_digital_objects_collection
SET geom_districts = (SELECT geom FROM districts AS t WHERE t.name NOT IN ('NORTH', 'EAST') AND (UPPER(fields->>'title') LIKE t.name || ' %' OR UPPER(fields->>'title') LIKE '% ' || t.name OR UPPER(fields->>'title') LIKE '% ' || t.name || ' %'))
WHERE
geom_townsites IS NULL AND geom_mapindices IS NULL AND
UPPER(fields->>'title') LIKE '%DISTRICT%' AND
(SELECT COUNT(*) FROM districts AS t WHERE t.name NOT IN ('NORTH', 'EAST') AND (UPPER(fields->>'title') LIKE t.name || ' %' OR UPPER(fields->>'title') LIKE '% ' || t.name OR UPPER(fields->>'title') LIKE '% ' || t.name || ' %')) = 1;
Final tally: 4,602 projects with a rough location.
That's probably the best we'll get from this approach. The rest of the names are a mix of things that could be roughly assigned a location by a human, but generally contain too many names of places and things to be done programmatically.
So all up we have 68% of projects with a rough spatial location assigned.
Once we know more about how we'll be using this information we'll probably want to process it a bit more e.g. Reducing these MultiPolygon geometries to bounding box envelopes or centroid points.
e.g. Working out that
AU WA S236- cons3869 Wellington 160 Plan showing locations for the Salvation Army (North-North-East of Collie) by N.J. Moore, Fieldbooks 20-24 [scale: 20 chains to an inch].
Should be at least roughly associated with the Swan land division, and preferably knowing that it's located around Collie.
So that's going to be fun!
Dependencies: #8 Next: #12 Related: #5
OK, now finally let's try to assign a spatial location to these
They seem to be categorised by the land divisions - which are noted in the titles (e.g. Wellington, Swan) - so that can at least get us down to a rough spatial location.
Maybe from there we can pull town and place names from the titles and do some rough matching against the known location of the land division? Think about how we can do that!
We could be lazy and just compare against the Gazetteer of Australia? Would be nice to use Landgate data though!
(Just because self-interest and maybe a good fluffy open data story)
https://en.wikipedia.org/wiki/Lands_administrative_divisions_of_Western_Australia
Process
Potential Challenges