LARGE: First Ingestion Workflow Examples

ranchodeluxe commented 1 year ago

Background

Async containerized job platforms offer decent UI/UX about jobs, log access and IDP auth. Goal of this ticket is use ArgoWorkflows to develop a couple easy ingestion workflows and documentation.

CSDA offers a succinct summary of the ingestion process
The eoAPI Getting Started docs give an idea of the STAC metadata that gets loaded via pypgstac CLI
If you crosswalk VEDA's Airflow ingestion DAGs with the CSDA summary above you might be able to make some headway but you might also be confused b/c there are extra weird steps that are VEDA specific.
Feel free to reach out to @ividito and @edkeeble to talk about details

AC:

[ ] build out a simple STAC ingestion workflow from GeoTIFF to COG
[ ] build out a simple STAC ingestion workflow for vectors into tipg.pgstac
[ ] write documentation on how to use it
[ ] would be good to see if for one simple example we can load test ArgoWorkflows and understand the semantics of queuing. Or this could be another ticket and next step. Your call

ranchodeluxe commented 1 year ago

@sunu: nice work on this. Saw that you had another repository up but I think we can just add this as another helm chart in /helm-chart folder similiar to how 2i2c does it for things folks can install with jupyterhub: https://github.com/2i2c-org/infrastructure/tree/master/helm-charts/support

If that seems gross let me know and I can do it in the future 😉

sunu commented 1 year ago

@ranchodeluxe I put the ingestion pipeline stuff in a separate repository because I don't have a clear idea about how I want it to be packaged. Also, I'm not sure if it's good enough to be public and in the "official" repo yet.

There are 3 parts to the ingestion pipeline repo:

Argo running in a k8s cluster - we can definitely include that here as a helm chart
A python cli to generate and submit argo workflows from a minimal workflow definition (https://github.com/developmentseed/eoapi-ingestion-argo/tree/main/ingest) through a command like: eoapi-ingest workflow submit workflows/maxar_opendata/workflow.yaml.
- I am not sure whether this python tool belongs in this repo. If we include it here, it should probably live in a separate folder; not in helm-chart/. Ideally, the python cli tool should be available for pip install too.
Dataset specific workflow definitions - eg: https://github.com/developmentseed/eoapi-ingestion-argo/tree/main/workflows/maxar_opendata
- this has 2 components:
  - the workflow definition: https://github.com/developmentseed/eoapi-ingestion-argo/blob/main/workflows/maxar_opendata/workflow.yaml
  - and optionally, custom dataset specific processing code to be injected into the pipeline: https://github.com/developmentseed/eoapi-ingestion-argo/tree/main/workflows/maxar_opendata/src
- We can probably add some of these examples wherever we put the source code for the python tool?

As a side note, @sharkinsspatial shared a recording of a previous eoapi pipeline discussion with me today. And watching that I learned quite a bit more about the specific pain points we are trying to solve and the current state of things. So I would love to discuss some of that and plan our next moves when we get the chance to catch up next time.

ranchodeluxe commented 1 year ago

@ranchodeluxe I put the ingestion pipeline stuff in a separate repository because I don't have a clear idea about how I want it to be packaged. Also, I'm not sure if it's good enough to be public and in the "official" repo yet.

There are 3 parts to the ingestion pipeline repo:
1. Argo running in a k8s cluster - we can definitely include that here as a helm chart

2. A python cli to generate and submit argo workflows from a minimal workflow definition (https://github.com/developmentseed/eoapi-ingestion-argo/tree/main/ingest) through a command like: `eoapi-ingest workflow submit workflows/maxar_opendata/workflow.yaml`.

   * I am not sure whether this python tool belongs in this repo. If we include it here, it should probably live in a separate folder; not in `helm-chart/`. Ideally, the python cli tool should be available for `pip install` too.

3. Dataset specific workflow definitions - eg: https://github.com/developmentseed/eoapi-ingestion-argo/tree/main/workflows/maxar_opendata

   * this has 2 components:

     * the workflow definition: https://github.com/developmentseed/eoapi-ingestion-argo/blob/main/workflows/maxar_opendata/workflow.yaml
     * and optionally, custom dataset specific processing code to be injected into the pipeline: https://github.com/developmentseed/eoapi-ingestion-argo/tree/main/workflows/maxar_opendata/src
   * We can probably add some of these examples wherever we put the source code for the python tool?
As a side note, @sharkinsspatial shared a recording of a previous eoapi pipeline discussion with me today. And watching that I learned quite a bit more about the specific pain points we are trying to solve and the current state of things. So I would love to discuss some of that and plan our next moves when we get the chance to catch up next time.

Sounds good, I'll let you decide how you want to structure things.

But I guess I don't see any technical limitations to why all the things you mention above couldn't just live in a new chart (something like /helm-chart/eoapi-ingest-argo/ in this repo) next to a Chart.yml file to support those requirements. The cli scripts could just be read from source and mounted as configmaps if needed or just live there to be executed. The 2i2c example does quite a bit of acrobatics with it's dependencies if you take a look at it

sunu commented 1 year ago

@ranchodeluxe ah, I think we both have a different deployment flow in mind for these ingestion jobs. Do you imagine each dataset will have its own helm chart and the ingestion jobs will be deployed through helm?

The deployment model I have in mind is somewhat different where each dataset has a custom docker image with all the scripts needed and the deployment is done through argo cli (or the python wrapper around it).

sunu commented 1 year ago

@ranchodeluxe I made a draft PR to test submitting ingestion jobs through helm: https://github.com/developmentseed/eoapi-k8s/pull/48. It works, but to me, submitting jobs through argo-cli feels a bit more natural than managing them via helm.

ranchodeluxe commented 1 year ago

@ranchodeluxe I made a draft PR to test submitting ingestion jobs through helm: #48. It works, but to me, submitting jobs through argo-cli feels a bit more natural than managing them via helm.

Sorry for the confusion @sunu. I wasn't talking about "submitting jobs" per se but more about deployment stuff as your last update is referencing. Do what feels best to you for submittal

ranchodeluxe commented 1 year ago

@ranchodeluxe ah, I think we both have a different deployment flow in mind for these ingestion jobs. Do you imagine each dataset will have its own helm chart and the ingestion jobs will be deployed through helm?

The deployment model I have in mind is somewhat different where each dataset has a custom docker image with all the scripts needed and the deployment is done through argo cli (or the python wrapper around it).

I'm fine following your proposed way @sunu. I'm just trying to limit one thousand repos from blooming 👍

developmentseed / eoapi-k8s

LARGE: First Ingestion Workflow Examples #30

Background