Open slifty opened 3 years ago
Did some digging, here are the results so far.
The prototype we created so far uses the Nominatum API, which is ultimately powered by Open Street Maps. This is fine, but they have a usage policy which makes it fairly clear that they don't really want full blown ETL pipelines built against it.
That said our data volume is on the lower end of things (e.g. thousands of addresses rather than millions). This means it might be possible for us to use the OSM Nominatum API while staying in the spirit of their terms, but it would would require a bit of engineering (in particular: caching and the ability to handle unexpected application of rate limits for larger batches).
The Nominatum API also might dramatically slow the ETL pipeline -- their spec says they want no more than 1 request per second (though later it does say small batch jobs are OK as well).
Using the existing API, as the prototype does, can be done. There is a bit of overhead associated with being a good FOSS citizen (and of course we risk them shutting down our requests if we don't follow their rules). Specifically, we would want to:
Figure out a way to cache geocoded data -- possibly treating geocoding as a preprocessing step that is only run once on a given column and then written back to the original data file and ultimately committed back to the SVN repo.
Make sure that ETL does not run multiple addresses through the API at once, meaning we only submit one request at a time.
Ensure that we never have infrastructure where we're running multiple batched addresses at one time (I don't think this will be a problem, but it's still a policy with long term technical implications)
This can be done! It is a big undertaking and we won't want to do it, but just in case somehow that changes, the instructions are here.
There are a few third party APIs some of which use Nominatum under the hood plus their own mix of spices / other FOSS tools.
Ideally we could use something that GeoPy supports out of the box, that way it's easy to swap out something truly FOSS (e.g. Nominatum) at any point in time.
Good news everyone -- It looks like OpenCage is supported by GeoPy.
I don't think the service itself is FOSS, though they do publish a lot of their code.
Importantly: switching geocode provider should be pretty darn simple thanks to GeoPy. Very importantly, the data is open, which is important too.
As part of our maps exploration we spent some time geocoding data; there's a desire to actually do that geocoding as part of the ETL pipeline.
This would give us a few things:
As part of this issue we should try to leverage the R&D done in that analysis repository, though it may turn out that the tools used there aren't a perfect fit for our ETL.