NASA-IMPACT / veda-data-airflow

Airflow implementation of ingest pipeline for VEDA STAC data
Other
10 stars 4 forks source link

Concurrency and taskflow refactors #197

Closed ividito closed 3 weeks ago

ividito commented 3 months ago

Summary:

PR is deployed and tested on SIT.

Addresses #192 (and cleans out some tech debt).

Batches of discovered files are traceable through the full ingestion process, and failures can be isolated to individual batches rather than full ingestions.

Initial diagram showing (generally) the outline of the changes made: Untitled-2024-02-05-1314(17)

New discover DAG visualization: image

Changes

ividito commented 3 months ago

This PR stacks additional changes on top of the vector ingests, which we're trying to get working in #198. This PR is ready for review, but let's hold off on merging until that PR is merged to main.

smohiudd commented 2 months ago

Testing plan: re-deploy to SIT and test ingestion

slesaad commented 2 months ago

This PR adds a new generic vector dag. Wondering if new updates are needed here to also support the same pattern for that dag?

smohiudd commented 2 months ago

I added the generic vector dag to the new pattern and deployed yesterday. It hasn't been tested yet though for either generic or EIS vector.

slesaad commented 2 months ago

I added the generic vector dag to the new pattern and deployed yesterday. It hasn't been tested yet though for either generic or EIS vector.

Oh that's great! Thanks!

smohiudd commented 2 months ago

@ranchodeluxe do we have a dev or test features db that that we test an EIS ingest for these new airflow changes?

ranchodeluxe commented 2 months ago

@ranchodeluxe do we have a dev or test features db that that we test an EIS ingest for these new airflow changes?

I literally just deleted it like an hour ago 😆 But if you need me to spin one back up I can do that

smohiudd commented 2 months ago

If it's not too much trouble could you spin it up? We need to test both the generic and EIS ingest with these changes

ranchodeluxe commented 2 months ago

Yeah, will do it after this big demo meeting 👍

On Wed, Aug 28, 2024 at 12:43 PM Saadiq Mohiuddin @.***> wrote:

If it's not too much trouble could you spin it up? We need to test both the generic and EIS ingest with these changes

— Reply to this email directly, view it on GitHub https://github.com/NASA-IMPACT/veda-data-airflow/pull/197#issuecomment-2316121382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABS7W364Y2ZMVI2JIUBHATZTYR6PAVCNFSM6AAAAABLYSYRRKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJWGEZDCMZYGI . You are receiving this because you were mentioned.Message ID: @.***>

-- Greg

ividito commented 1 month ago

We put some more work into testing the vector ingest, to make sure this won't break firelines once it gets to staging. To summarize:

Some changes and notes to make this work in the modified VPC environment:

smohiudd commented 1 month ago

Vector subnets and SG need to be the same as the RDS hosting features, and the security group needs an inbound rule accepting traffic from itself (this is a bit different on staging, which has a ton of manual SG changes)

This is not necessary. The vector subnets need to reference the private subnets in the vector VPC (shared base VPC in our case) but the SG should not be the RDS' - the MWAA variable should be using the terraform created ECS sg. The reason the SIT deployment wasn't working was because the rds sg didn't have an inbound rule for the ECS sg. The inbound rule is in IAC but may have been modified manually.

anayeaye commented 4 weeks ago

I've run a few ingests in the sit mwaa via the sit dataset/publish endpoint and the results look good. After we get the automated vector ingest I think we are ready to talk about merging this into dev--also might be time to pull in the upstream changes again :(.