AWS 2-6: Evaluate Adding Hybrid Tiling Endpoint

As of #9, we'll have static tiles for lower zoom levels, but we'll use regular TiTiler operation for higher zoom levels. This should be transparent to the front-end, which should hit only one endpoint. This card is to create that proxy endpoint which handles these decisions.

Because TiTiler is cloud-native, with each tile request spinning up its own Lambda, it makes sense to add the proxy endpoint to TiTiler itself (or more precisely, to the TiTiler-MosaicJSON fork).

This card is to:

[ ] Study TiTiler-MosaicJSON to familiarize with the codebase
[ ] Evaluate if the proxy endpoint should be in TiTiler or a separate endpoint in front of it

After some conversations and some experimentation in the staging AWS console, I have a picture of how I think this should work. It's not exactly what's described here and on issue #11, but I think it would achieve the goal in a way that takes advantage of existing resources and would be reasonably straightforward to implement and maintain.

The big difference is that rather than modify TiTiler-MosaicJSON to be able to return cached tiles and write generated tiles to the cache bucket, it would leave the TiTiler-MosaicJSON API unchanged but put a CloudFront distribution in front of it that would send some tile requests on to the API but handle others from an S3 bucket that we manually populate with pregenerated tiles.

Infrastructure

So the infrastructure would be:

An S3 bucket that we manually fill with a static tile pyramid, for whatever layers and zoomlevels we want to avoid asking TiTiler to generate dynamically.
- Note: as I said in PR 12, I'm not sure pre-generating the tiles using TiTiler is the best strategy. So part of the work of setting this up would be to develop a script or process to (efficiently and repeatably) do the static tile generation.
A CloudFront distribution with...
- Two origins, TiTiler and S3
- Two behaviors, one for low-zoomlevel tiles that uses the S3 origin and one for high-zoomlevel tiles that uses the TiTiler endpoint.
- Note that the distinction between what goes where would need to be handled either by encoding it in the "Path pattern" for the TiTiler behavior or by setting up the distribution to try the S3 origin first and fall back to the TiTiler origin if the tile isn't found.
- The URL syntax for the two sources is different—for S3 it will just be the file path, for TiTiler it's longer and includes the UUID of the mosaic and the whole colormap as a query param. But we can handle that by defining a CloudFront Function for the TiTiler behavior to translate the URLs. Which has the advantage that we can hide the UUIDs and colormap in that function so the actual tile endpoint URL would look fairly normal.

This would have some advantages and some disadvantages vs the idea of adding the cache behavior into TiTiler-MosaicJSON: Advantages:

No need to customize TiTiler-MosaicJSON at all
Deployment of the TiTiler endpoint could stay almost as it is now Disadvantages:
Have to fully populate the cache by hand, and it wouldn't grow over time, so dynamically-generated tiles will always be dynamic (aside from CloudFront caching)

Deployment

The question of how to deploy the new infrastructure while keeping the existing TiTiler-MosaicJSON endpoint is a bit tricky. The current deployment uses https://github.com/Element84/filmdrop-aws-tf-modules, so this repo only includes some config files and a Github Actions workflow that checks out and applies the Filmdrop deployment code. Which is pretty slick and provides a lot of functionality without a lot of new code or config, but it doesn't lend itself super-well to adding additional infrastructure or modifying what's included in the FilmDrop deployment beyond toggling which major components are desired.

Specifically, the challenge here is that we want an S3 bucket that's not provided by the FilmDrop deployment and we want the CloudFront distribution that sits in front of the TiTiler service to have an additional origin and some custom behaviors.

One way to make that happen would be to raid filmdrop-aws-tf-modules for the parts of the code/config that we're actually using and copy them over to this repo, i.e. make a copy/paste/modify fork of only the parts we want, then add the new stuff to that.

The other possibility, which seems a little messier but would preserve the benefit of relying on filmdrop-aws-tf-modules for the TiTiler deployment (and keep the FilmDrop UI console, which is active on staging from the current deployment setup but which we presumably wouldn't bring over if we were copying parts by hand), would be to create new Terraform files in this repo that get applied along with the FilmDrop deployment. So e.g. we add a file like hybrid_tiler.tf in the same directory where the current build copies the filmdrop-aws-tf-modules files (I would advocate doing it in a terraform/ subdirectory rather than at the project root) and when the CI job runs plan and apply, the new resources are included.

The first big hurdle for the latter approach is the CloudFront distribution—there already is one, so we would need to either switch that part of the deployment off and make our own (reducing the benefit of using the existing deployment framework) or find a way to modify it.

One other note re deployment: most of the MMW resources are in us-east-1, but the TiTiler/FilmDrop deployment is in us-west-2, I assume because that's the default in the templates. We would probably want to switch to using us-east-1 for everything, to avoid confusion.

Proof-of-concept

I manually created a proof-of-concept implementation in the staging AWS account, so if you change the "IO Global LULC 2023" layer config in your MMW instance to set 'url': 'https://d9hcypg7gthru.cloudfront.net/2023/{z}/{x}/{y}.png' and 'maxNativeZoom': 18, you can see the hybrid tiler in action. The resources are this CloudFront distribution and this S3 bucket.

@rajadain since this isn't a PR but the next step is review, I added you as an additional assignee.

👏 👏 fantastic work! Really detailed and well written plan.

Here's some thoughts:

I agree that doing this outside of FilmDrop UI / TiTiler MosaicJSON seems like the right move at this point. Be it CloudFront or something else, the more we can do without learning and changing a complex codebase, the better
It would be easier for MMW if there was a single endpoint to hit, rather than multiple. I like there being a single CloudFront distribution in the above proposal, but that could be implemented elsewhere too
I prefer the check-s3-first-then-fetch-from-source pattern over the fetch-from-s3-if-this-zoom-or-from-source-if-that-zoom pattern, for two reasons:
1. The former is not tied to the zoom level, thus is more flexible. If in the future we decide we need to pre-cache additional zoom levels, we can just upload those tiles to S3, and don't have to touch the codebase
2. The latter does not cache to S3. Either it caches to CloudFront, or not at all, which is different enough from the S3 behavior to potentially cause issues at runtime

Give the above, we should timebox checking if the check-s3-first-then-fetch-from-source pattern can be implemented in a CloudFront Function or Lambda@Edge. The main issue I foresee there is writing to S3, which we would have to for all data fetched from the source on an S3 miss. I don't know if that can be done from CloudFront Functions or Lambda@Edge.

If not easily and obviously possible, we should pivot to add this new endpoint to Windshaft in the current implementation of MMW: https://github.com/WikiWatershed/model-my-watershed/tree/develop/src/tiler. For a given layer code, the Windshaft server should query the TiTiler CloudFront (instead of Postgres like it currently does for all other layer codes), and use the S3 cache just as it does for the other layers.

The main disadvantage of using Windshaft for the above is the lack of massive parallelization we get with TiTiler, which being Lambda driven can scale horizontally to a great extent. But Windshaft has worked well enough for us so far, so its quite possible that it will continue to scale for a while before reaching its limit.

If we go with Windshaft, since that is running in us-east-1, then we should move this entire deploy to us-east-1. If we can use CloudFront Functions or Lambda@Edge, then we could stay in us-west-2.

The reason for putting this in us-west-2 was that the underlying data (Impact Observatory Annual LULC) is stored there. Since it is in the AWS Open Registry program, we don't have to pay egress fees on it, but moving it to a different region would cause latency. If we're exposing the data to the internet, doing it sooner is better.

Let me know if the above makes sense.

Summarizing the huddle @rajadain and I had this afternoon:

The "check S3 the fall back to the endpoint" setup I had in mind turns out to be not a CloudFront behavior but an S3 Website feature. I.e. you set up S3 to act as a static web site and you configure a custom redirect that triggers when it would otherwise return a 404. So it works, but it's not quite as simple as I was remembering it.
CloudFront functions only run on the viewer side of a CloudFront behavior, not the origin side. So we'd need to use a Lambda@Edge function. Which I do think could work, but would also have the drawback that a Lambda function can't keep doing work after it returns, so it couldn't do the upload to S3 asynchronously/in parallel with returning the tile to the client, i.e. the response would be delayed by the time it takes to do the upload to S3.
So pivoting to Windshaft seems like the better answer.
- It already has the type of S3 caching we want.
- It would require much less new infrastructure.
- It should be able to cache tiles asynchronously.

That will require implementing a new route and method in our Windshaft deployment that gets tiles from an external endpoint rather than from the database.

The rollout will be a little bit complex, because the mosaic UUIDs will need to be generated for the production environment and provided to Windshaft. So we'll need to do the TiTiler-MosaicJSON deployment in advance, then do the manual mosaic creation and tile pregeneration, then provide the endpoint and year->UUID mapping to the production Windshaft deployment via Ansible variables.

Re what region to deploy to: as noted above, the data is in us-west-2 but the Windshaft servers are in us-east-1. Since access to the data is free but egress fees would apply when we call the TiTiler endpoint from Windshaft, deploying to us-east-1 would be preferable. However, there's a chance that would slow the TiTiler endpoint down because it would have to get the underlying data it's using to make the tiles from farther away. Probably the way to handle this question is to deploy the production TiTiler service to us-east-1 and see if the latency is substantially worse than it is on the staging endpoint, then decide whether to keep it there or pull it down and redeploy it in us-west-2.

WikiWatershed / mmw-tiler