GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
549 stars 88 forks source link

Refactor Airflow ETL Pipeline DAG to incorporate feedback changes #4576

Open btylerburton opened 6 months ago

btylerburton commented 6 months ago

User Story

In order to incorporate updates to the datagov-harvesting-logic API, and feedback from the most recent design sessions, changes need to be made to the Airflow ETL pipeline DAG in order to fully test a DCAT-US record end-to-end.

Related:

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

Reference

diagram

btylerburton commented 5 months ago

Airflow has no problem connecting to CKAN or posting datasets.

Airflow logs Image

CKAN Dev UI Image

btylerburton commented 4 months ago

Running load test from local Docker and seeing similar results to GH Action.

Image

btylerburton commented 4 months ago

Next steps will be to replicate CG infrastructure in Staging. This allows us to validate clean setup for a new space and to move off dependency on local.

btylerburton commented 4 months ago

Results of load test.

Date: 02.13.24 Time: 01h 15m 55s

Logs: [2024-02-13, 18:37:49 UTC] {harvest.py:336} INFO - expected operations to be done [2024-02-13, 18:37:49 UTC] {harvest.py:337} INFO - {'delete': 0, 'create': 982, 'update': 0} [2024-02-13, 18:37:49 UTC] {harvest.py:355} INFO - actual operations completed [2024-02-13, 18:37:49 UTC] {harvest.py:356} INFO - {'deleted': 0, 'updated': 0, 'created': 937, 'nothing': 45} [2024-02-13, 18:37:49 UTC] {harvest.py:359} INFO - validity of the records [2024-02-13, 18:37:49 UTC] {harvest.py:360} INFO - {'valid': 982, 'invalid': 0}

Image

btylerburton commented 4 months ago

The load test was performed and Airflow handled the job as expected. As our conversation about our use of the tool has evolved, the team has decided to pivot away from using Airflow--at least in the interim--due to the high cost of support in terms of infrastructure cost as well as time to learn the platform, versus the advantages that it was expected to bring. In short, our use case (high throughput, minimal analysis) does not overlap as nicely with Airflow's strengths as we'd expected.