bcgov / cthub

Apache License 2.0
1 stars 5 forks source link

CTHUB - List of geographic locations #322

Closed katerinkus closed 2 months ago

katerinkus commented 4 months ago

Describe the task GER dataset and likely other datasets contain community names. In order to find typos, the easiest way is to first obtain a list of all communities in BC. The goal of this task is to research where to obtain a list that is updated regularly by Stats Canada or BC Government that we can access via API.

Purpose To find typos in geographic name spellings.

Timebox 1 day

Acceptance Criteria

Additional context A full list can be found here: Stats Canada. BC also has BC Geographical Names portal here. They are both problematic because the former is a CSV download and the latter does not appear to have an API. But if we do not find anything better, we can use the former.

ArawuSamuel1 commented 3 months ago

Hey team! Please add your planning poker estimate with Zenhub @emi-hi @JulianForeman @tim738745

emi-hi commented 3 months ago

there is an api that the bc geographical names portal uses that we can also use but i can't find anything about sending batches of names; if the user is uploading a file with a few hundred locations then it might take a lot of time for it to send a request to the api for each record. On the plus side, however, it does include lats and longs. https://openapi.apps.gov.bc.ca/?url=https://raw.githubusercontent.com/bcgov/api-specs/master/bcgnws/bcgnws.json#/

After talking to Tim about what we think is going to be done with this data, we are leaning more towards keeping the data within the system and doing joins to check spelling. We can probably get lats and longs this way. If we are wanting to use an API to ensure the data is up to date, this would be a bit more work to keep updated but I'm not sure if communities/cities/etc change names as often as other geographic names

emi-hi commented 3 months ago

@tim738745 @katerinkus

katerinkus commented 3 months ago

@emi-hi Sounds great! Many thanks for finding it. Would it be possible to update the data in our system every year or so?

emi-hi commented 3 months ago

the gazette is a downloadable file that contains all of the placenames in bc including cities, communities, localities, reservations, etc. I filtered out all of the rocks, mountains, points and other irrelevant data and am left with 3264 names. It does contain lats/longs so could be used to mark points in metabase

https://catalogue.data.gov.bc.ca/dataset/d92224ee-03ef-4904-be53-b677d8e01ac4

tim738745 commented 3 months ago

@katerinkus @emi-hi Just to add to what Emily wrote, we thought that to check a community name, we would have to make a request to the API with the name and with the "exactSpelling" flag set to true, and so that means we would have to query the API once for each record in an uploaded spreadsheet. This is less scalable than using a downloaded data set, which is why we were leaning towards using a download.

But I realized another way of doing the checks would be to : (1) send the API a string of community names (with exactSpelling set to false), and it does return a list of features that match the names according to its own internal matching logic. (2) Iterate through our community names and see which ones are in the list (meaning no typo), and which ones are not (meaning typo)

Then we would only have to do one query per upload, and we wouldn't have to maintain our own dataset of geographic locations.

Sorry for changing my mind, but now I'm leaning more towards using the API!

emi-hi commented 3 months ago

okay i'll try to figure out how to include in the query just communities/cities etc so we just get those in the list, so far i havent had any luck!

tim738745 commented 3 months ago

@emi-hi I think you can use the featureType parameter of the API (featureType=1 means city, =2 means community; see the /featureTypes endpoint). So we can do 2 search queries, one for cities and the other with communities, each with the same search string.

Alternatively, we can do featureType=* (the default), and we can process the query result on our end to exclude any of the features that's not of the type we want.

emi-hi commented 3 months ago

i think we are going to need 15-20 feature types so maybe processing on our end will be quickest

emi-hi commented 3 months ago

these are the options and I've narrowed it down to 18 but might need to make some adjustments.

image.png
shayjeff commented 2 months ago

Reviewed