NASA-IMPACT / veda-backend

Backend services for VEDA
Other
12 stars 5 forks source link

[DISCUSSION] Backend decisions/architecture guidelines #2

Open leothomas opened 2 years ago

leothomas commented 2 years ago

Background

Since this is a bit of a green-field project I wanted to jot down some notes about how to structure the project.

Some guiding principles:

Three constituent parts are:

Everything after this line is up for discussion and debate.

Tiling API:

Option 1: TiTiler as an external dependency

TiTiler is available as a pip-installable dependency. To add the TiTiler routes to an existing FastAPI application:

from titiler.core.factory import TilerFactory

# Create a FastAPI application
app = FastAPI(
    description="A lightweight Cloud Optimized GeoTIFF tile server",
)

# Create a set of COG endpoints
cog = TilerFactory()

# Register the COG endpoints to the application
app.include_router(cog.router, tags=["Cloud Optimized GeoTIFF"])

This will include the following routes: https://developmentseed.org/titiler/endpoints/cog/

This makes the Tiling functionality very easy and straightforward to add as an external dependency to the project - which reduces the amount of code we have write and maintain and ensures that we are able to easily stay up to date with developments to the tiler.

Option 2: Fork TiTiler

I'm not sure why we would want to do this, but I'm including as an option worthy of discussion

STAC API:

Some unknowns about stac-fastapi:

pg_stac:

stac-fastapi is built on top of pg_stac. pg_stac provides the postgres schemas and functions necessary for implementation of the STAC API.

The pg_stac schemas/functions can be applied to the database using 2 options:

I believe the preferred method to apply the schemas/functions is using alembic, which can be run within a python script, using the sqlalchemy psycopg2 database connector.

The csdap-orders STAC API CDK deployment code has an example of defining a custom resource within the CDK stack, where the custom resource is the actual execution of a lambda function, in this case, the lambda function runs the database setup (or "bootstrapping") using alembic. I'm a little bit fuzzy on the details, but I believe this is the overall idea.

Unknowns:

TiTiler-pgstac:

The titiler-pgstac is an extension for TiTiler that connects directly to a pg_stac database to generate mosaics dynamically from STAC queries. This may be an interesting options to explore.

Unknowns:

Due to STAC's widespread adoption and strong specification, we want the tiler to work with any STAC API/database. This means that we would like the tiler to be able to generate mosaics from a STAC API endpoint, rather than directly accessing the database. If the TiTiler-pgstac extension provides a STAC API endpoint, it would be easy to configure the tiler to access the STAC records through any API endpoint, enabling this backend for both scientists that want to build and manage their own STAC API or scientists that already have a STAC API up and running.

If the Titiler-pgstac does not provide a STAC API endpoint then we should stay away from this implementation in order to enable the tiler to be compatible with any STAC API.

API Gateway + Lambda vs ELB + ECS:

Paraphrasing Drew: we used to prefer Lambda for the near infinite scalability, and for the lack of costs when there's no traffic to the application. Some drawbacks to Lambda include limited execution time and memory, inability to make use of multi-threading or to re-use database connections to speed up operations (since each execution may or may not take place in a new container instance). An example of this: GDAL natively implements some cacheing mechanisms (1, 2) in order to serve adjacent tiles simultaneously (or at least faster). This cacheing functionality is lost when running the tileing logic in Lambda, since each request might be served by a different container, which means that each tile read must wait for the file lock to be released before locking the file itself, transforming the simultaneous reads that are made possible by GDAL into a single threaded operation.

One way to get around these issues is to implement intelligent cacheing in the lambda function. (Note: It's important not to directly cache the result of an API call with the associated URL parameters, since some of the parameters only affect the display options of the tile, but rather cache the actual data that gets pulled from S3 - this way a single cached tile can serve multiple requests with different visualization parameters)

If partners accept the (relatively small) cost of ECS services at rest, since ECS has to have at least one EC2 instance running at any given time, and if the scaling rules are well defined, ECS can be favorable for the reasons listed above: better use of multi-threading, re-using database connections, ability to handle larger data tiles/longer runtime operations, etc without timing out.

For all the reasons above, ECS seems to be a better option for the tiling API. If the STAC API requests are small, self contained, and not memory intensive, it may make sense to implement the STAC as a Lambda + API Gateway stack. In this case, it might make sense to have 2 separate FastAPI app's (the tiler deployed as ECS + ELB and the STAC API deloyed at API Gateway + Lambda). Although this has the obvious disadvantage of requiring 2x as much infrastructure to manage.

A distinct advantage of API Gateway is the out-of-the-box integration with cognito user pools for user authentication/authorization (which would ensure that unauthorized users cannot write or edit records to the STAC database). I believe it might also be possible to integrated API Gateway with an ELB.

Ingestion pipeline:

The other large part of the backend is an ingestion pipeline that allows scientists to upload their data. At a minimum, this pipeline should do the the following for any uploaded datafile:

With successive iterations, this pipeline can add additional processing options:

Architecture:

The proposed architecture for the ingestion pipeline would be:

Some advantages of this architecture:

Potential improvements to the architecture:

A note on repository structure:

Q: Should all three backend services (tiling API, STAC API and ingest pipeline) be their own separate repositories? A: The fact that this issue is in a repo called delta-backend seems to indicate that we've answered no to this question, but I'm not convinced. The answer to this question is also affected by the decision of how we implement the tiling API and STAC API (ie: are both separate FastAPI apps? Is the TiTiler added as an extension to stac-fastapi? Does titiler-pgstac provide a STAC API endpoint?

As we answer these questions, we should have a clearer idea as to how to structure the backend (1 vs 3 github repos). Some things to keep in mind (or to debate):

A note on CDK structure:

Following some of these articles:

CDK stacks are composable (ie: a single stack can be made up of other stacks). I would like the backend to be deployed as a single stack, but composed of sub-stacks, that are each individually deployed on their as well. Arbitrarily assuming that all three component stay in the same repo for the sake of this example, an example file structure would look like:

tiler_api/
  |_ setup.py  # runtime libs + CDK libs for needed to deploy   
  |_ stack.py # CDK Stack for the TiTiler app
  |_ runtime/
      |_ api/
        |_ main.py # fastapi app definition that imports TiTiler and adds it to the routes    
        |_... # other fastapi related files
stac_api/
  |_ setup.py # runtime libs  + CDK libs for needed to deploy
  |_ stack.py # CDK Stack for the STAC API
  |_ db/
    |_ construct.py # CDK file that generates a Construct for the RDS instance
  |_ bootstrapper/
    |_ infrastructure/ 
      |_construct.py # CDK file that generates a Construct for the Lambda function that bootstraps the postgres DB
      |_ Dockerfile 
    |_ runtime/
      |_ handler.py
ingest_pipeline/
  |_ setup.py # runtime libs + CDK libs for needed to deploy
  |_ stack.py # CDK Stack for the ingest pipeline (SQS Queue + DeadLetter Queue + S3 Bucket + Trigger)
  |_ processing_lambda/
    |_ runtime/
      |_ handler.py
      |_ ... # other supporting files for processing lambda
  |_ infrastructure/
    |_ Dockerfile
    |_ construct.py
app_stack.py

The main aspects are:

cc/ @anayeaye @abarciauskas-bgse @olafveerman @drewbo

vincentsarago commented 2 years ago

👋 @leothomas First great ticket 👏

Tiling API:

Option 1 Is def the best and why we built titiler as python module.

Stac-API

Is there an out-of-the-box integration with TiTiler somehow?

There was, but it has been removed because both services shouldn't run in the same application.

TiTiler-pgstac:

Does this extension provide a STAC API endpoint on the tiler? Or just the ability to generate mosaic's from STAC queries?

No, again stac-api and tiler application should always be separated.

eoAPI

Must of what you are describing (api + deployment) is what I've been experiencing with https://github.com/developmentseed/eoAPI

It's a repo with both Tiler and STAC-API python application + CDK code. (note: it's not up to date)

Note:

in eoAPI we have added a hack to link the stac-api and titiler.

we have added TiTilerExtension which adds a /collections/{collectionId}/items/{itemId}/tilejson.json" endpoint https://github.com/developmentseed/eoAPI/blob/master/src/eoapi/stac/eoapi/stac/extension.py#L22 which itself redirect to titiler endpoints (stac-api deployment needs to know the titiler endpoint url)

BUT the real hack is that we are passing the STAC item with the url (this is to avoid for the tiler to fetch the stac item on each tiler requests) https://github.com/developmentseed/eoAPI/blob/master/src/eoapi/stac/eoapi/stac/extension.py#L100

Happy to talk more about eoAPI if you need :-)

anayeaye commented 2 years ago

Sharing this sketch discuss what I think are our current high level architecture choices (and questions, and probably some of my own misunderstandings). With the examples in eoAPI, I think we can start building out piecewise from the RDS construct.

Base app:

Initial constructs:

  1. PgSTAC [RDS]
  2. TiTiler API (pip install until/if customization needed) [ECS]
  3. STAC-Fast API (pip install until customization needed) [ECS]

Scratch STAC-API-AWS TiTiler Add On (source)

leothomas commented 2 years ago

@anayeaye This diagram is super helpful, thanks!

I've got a couple questions/comments that I'd appreciate your input on:

anayeaye commented 2 years ago

@leothomas I hastily updated the sketch after our tag-up with @vincentsarago to make it a little less wrong. This diagrammed scenario is for a stack that runs two services (A stac-api and a tiling-api) for same RDS PgSTAC instance. I don't think we settled on this design but I wanted to try to capture the scenario anyway.

CHANGED:

  1. Separate of tiler and stac-api (connection in original drawing was wrong)
    • STAC-FastAPI is only for metadata requests and possibly transactions to load ingested metadata. Or maybe we prefer to use pypgstac to load metadata in database and do not enable transactions in STAC-FastAPI because we don't want to expose edit operations to all API users.
    • TiTiler-PgSTAC API accepts a STAC search query but it connects directly to the database (does not use STAC-FastAPI)
  2. Use Lambda infrastructure for more responsive scaling.
  3. Placeholders for CloudFront(s) and Auth that I think we will want but that needs thought.

Scratch STAC-API-AWS Decoupled Services EDIT: (source)

abarciauskas-bgse commented 2 years ago

I second @vincentsarago comment that this is a great ticket @leothomas

Exciting to hear about eoAPI, I think we will be using that a lot in this and other projects 😃

And thanks for the great diagrams @anayeaye