WikiWatershed / mmw-tiler

Tiler for Model My Watershed
Apache License 2.0
0 stars 0 forks source link

AWS 2-6: Evaluate Adding Hybrid Tiling Endpoint #10

Open rajadain opened 3 months ago

rajadain commented 3 months ago

As of #9, we'll have static tiles for lower zoom levels, but we'll use regular TiTiler operation for higher zoom levels. This should be transparent to the front-end, which should hit only one endpoint. This card is to create that proxy endpoint which handles these decisions.

Because TiTiler is cloud-native, with each tile request spinning up its own Lambda, it makes sense to add the proxy endpoint to TiTiler itself (or more precisely, to the TiTiler-MosaicJSON fork).

This card is to:

KlaasH commented 2 months ago

After some conversations and some experimentation in the staging AWS console, I have a picture of how I think this should work. It's not exactly what's described here and on issue #11, but I think it would achieve the goal in a way that takes advantage of existing resources and would be reasonably straightforward to implement and maintain.

The big difference is that rather than modify TiTiler-MosaicJSON to be able to return cached tiles and write generated tiles to the cache bucket, it would leave the TiTiler-MosaicJSON API unchanged but put a CloudFront distribution in front of it that would send some tile requests on to the API but handle others from an S3 bucket that we manually populate with pregenerated tiles.

Infrastructure

So the infrastructure would be:

This would have some advantages and some disadvantages vs the idea of adding the cache behavior into TiTiler-MosaicJSON: Advantages:

Deployment

The question of how to deploy the new infrastructure while keeping the existing TiTiler-MosaicJSON endpoint is a bit tricky. The current deployment uses https://github.com/Element84/filmdrop-aws-tf-modules, so this repo only includes some config files and a Github Actions workflow that checks out and applies the Filmdrop deployment code. Which is pretty slick and provides a lot of functionality without a lot of new code or config, but it doesn't lend itself super-well to adding additional infrastructure or modifying what's included in the FilmDrop deployment beyond toggling which major components are desired.

Specifically, the challenge here is that we want an S3 bucket that's not provided by the FilmDrop deployment and we want the CloudFront distribution that sits in front of the TiTiler service to have an additional origin and some custom behaviors.

One way to make that happen would be to raid filmdrop-aws-tf-modules for the parts of the code/config that we're actually using and copy them over to this repo, i.e. make a copy/paste/modify fork of only the parts we want, then add the new stuff to that.

The other possibility, which seems a little messier but would preserve the benefit of relying on filmdrop-aws-tf-modules for the TiTiler deployment (and keep the FilmDrop UI console, which is active on staging from the current deployment setup but which we presumably wouldn't bring over if we were copying parts by hand), would be to create new Terraform files in this repo that get applied along with the FilmDrop deployment. So e.g. we add a file like hybrid_tiler.tf in the same directory where the current build copies the filmdrop-aws-tf-modules files (I would advocate doing it in a terraform/ subdirectory rather than at the project root) and when the CI job runs plan and apply, the new resources are included.

The first big hurdle for the latter approach is the CloudFront distribution—there already is one, so we would need to either switch that part of the deployment off and make our own (reducing the benefit of using the existing deployment framework) or find a way to modify it.

One other note re deployment: most of the MMW resources are in us-east-1, but the TiTiler/FilmDrop deployment is in us-west-2, I assume because that's the default in the templates. We would probably want to switch to using us-east-1 for everything, to avoid confusion.

Proof-of-concept

I manually created a proof-of-concept implementation in the staging AWS account, so if you change the "IO Global LULC 2023" layer config in your MMW instance to set 'url': 'https://d9hcypg7gthru.cloudfront.net/2023/{z}/{x}/{y}.png' and 'maxNativeZoom': 18, you can see the hybrid tiler in action. The resources are this CloudFront distribution and this S3 bucket.

KlaasH commented 2 months ago

@rajadain since this isn't a PR but the next step is review, I added you as an additional assignee.

rajadain commented 2 months ago

👏 👏 fantastic work! Really detailed and well written plan.

Here's some thoughts:

  1. I agree that doing this outside of FilmDrop UI / TiTiler MosaicJSON seems like the right move at this point. Be it CloudFront or something else, the more we can do without learning and changing a complex codebase, the better
  2. It would be easier for MMW if there was a single endpoint to hit, rather than multiple. I like there being a single CloudFront distribution in the above proposal, but that could be implemented elsewhere too
  3. I prefer the check-s3-first-then-fetch-from-source pattern over the fetch-from-s3-if-this-zoom-or-from-source-if-that-zoom pattern, for two reasons:
    1. The former is not tied to the zoom level, thus is more flexible. If in the future we decide we need to pre-cache additional zoom levels, we can just upload those tiles to S3, and don't have to touch the codebase
    2. The latter does not cache to S3. Either it caches to CloudFront, or not at all, which is different enough from the S3 behavior to potentially cause issues at runtime

Give the above, we should timebox checking if the check-s3-first-then-fetch-from-source pattern can be implemented in a CloudFront Function or Lambda@Edge. The main issue I foresee there is writing to S3, which we would have to for all data fetched from the source on an S3 miss. I don't know if that can be done from CloudFront Functions or Lambda@Edge.

If not easily and obviously possible, we should pivot to add this new endpoint to Windshaft in the current implementation of MMW: https://github.com/WikiWatershed/model-my-watershed/tree/develop/src/tiler. For a given layer code, the Windshaft server should query the TiTiler CloudFront (instead of Postgres like it currently does for all other layer codes), and use the S3 cache just as it does for the other layers.

The main disadvantage of using Windshaft for the above is the lack of massive parallelization we get with TiTiler, which being Lambda driven can scale horizontally to a great extent. But Windshaft has worked well enough for us so far, so its quite possible that it will continue to scale for a while before reaching its limit.

rajadain commented 2 months ago

If we go with Windshaft, since that is running in us-east-1, then we should move this entire deploy to us-east-1. If we can use CloudFront Functions or Lambda@Edge, then we could stay in us-west-2.

The reason for putting this in us-west-2 was that the underlying data (Impact Observatory Annual LULC) is stored there. Since it is in the AWS Open Registry program, we don't have to pay egress fees on it, but moving it to a different region would cause latency. If we're exposing the data to the internet, doing it sooner is better.

Let me know if the above makes sense.

KlaasH commented 2 months ago

Summarizing the huddle @rajadain and I had this afternoon:

That will require implementing a new route and method in our Windshaft deployment that gets tiles from an external endpoint rather than from the database.

The rollout will be a little bit complex, because the mosaic UUIDs will need to be generated for the production environment and provided to Windshaft. So we'll need to do the TiTiler-MosaicJSON deployment in advance, then do the manual mosaic creation and tile pregeneration, then provide the endpoint and year->UUID mapping to the production Windshaft deployment via Ansible variables.

Re what region to deploy to: as noted above, the data is in us-west-2 but the Windshaft servers are in us-east-1. Since access to the data is free but egress fees would apply when we call the TiTiler endpoint from Windshaft, deploying to us-east-1 would be preferable. However, there's a chance that would slow the TiTiler endpoint down because it would have to get the underlying data it's using to make the tiles from farther away. Probably the way to handle this question is to deploy the production TiTiler service to us-east-1 and see if the latency is substantially worse than it is on the staging endpoint, then decide whether to keep it there or pull it down and redeploy it in us-west-2.