OpenTechStrategies / torque-sites

Open source code specific to OTS-managed Torque sites (usually client sites).
3 stars 1 forks source link

Geocode during ETL #139

Open slifty opened 2 years ago

slifty commented 2 years ago

As part of our maps exploration we spent some time geocoding data; there's a desire to actually do that geocoding as part of the ETL pipeline.

This would give us a few things:

  1. Any application would have access to geocoded data without needing to re-transform. This also opens possibilities for search and other functions.
  2. If data can't be geocoded, we would be able to know that right away.

As part of this issue we should try to leverage the R&D done in that analysis repository, though it may turn out that the tools used there aren't a perfect fit for our ETL.

slifty commented 2 years ago

Did some digging, here are the results so far.

The Situation

The prototype we created so far uses the Nominatum API, which is ultimately powered by Open Street Maps. This is fine, but they have a usage policy which makes it fairly clear that they don't really want full blown ETL pipelines built against it.

That said our data volume is on the lower end of things (e.g. thousands of addresses rather than millions). This means it might be possible for us to use the OSM Nominatum API while staying in the spirit of their terms, but it would would require a bit of engineering (in particular: caching and the ability to handle unexpected application of rate limits for larger batches).

The Nominatum API also might dramatically slow the ETL pipeline -- their spec says they want no more than 1 request per second (though later it does say small batch jobs are OK as well).

Some Options

1: Use the OSM Nominatum API

Using the existing API, as the prototype does, can be done. There is a bit of overhead associated with being a good FOSS citizen (and of course we risk them shutting down our requests if we don't follow their rules). Specifically, we would want to:

  1. Figure out a way to cache geocoded data -- possibly treating geocoding as a preprocessing step that is only run once on a given column and then written back to the original data file and ultimately committed back to the SVN repo.

  2. Make sure that ETL does not run multiple addresses through the API at once, meaning we only submit one request at a time.

  3. Ensure that we never have infrastructure where we're running multiple batched addresses at one time (I don't think this will be a problem, but it's still a policy with long term technical implications)

2: Self host the Nominatum API

This can be done! It is a big undertaking and we won't want to do it, but just in case somehow that changes, the instructions are here.

3: Use a paid API

There are a few third party APIs some of which use Nominatum under the hood plus their own mix of spices / other FOSS tools.

Ideally we could use something that GeoPy supports out of the box, that way it's easy to swap out something truly FOSS (e.g. Nominatum) at any point in time.

slifty commented 2 years ago

Good news everyone -- It looks like OpenCage is supported by GeoPy.

I don't think the service itself is FOSS, though they do publish a lot of their code.

Importantly: switching geocode provider should be pretty darn simple thanks to GeoPy. Very importantly, the data is open, which is important too.