Open btylerburton opened 6 months ago
Airflow has no problem connecting to CKAN or posting datasets.
Airflow logs
CKAN Dev UI
Running load test from local Docker and seeing similar results to GH Action.
Next steps will be to replicate CG infrastructure in Staging. This allows us to validate clean setup for a new space and to move off dependency on local.
Results of load test.
Date: 02.13.24 Time: 01h 15m 55s
Logs: [2024-02-13, 18:37:49 UTC] {harvest.py:336} INFO - expected operations to be done [2024-02-13, 18:37:49 UTC] {harvest.py:337} INFO - {'delete': 0, 'create': 982, 'update': 0} [2024-02-13, 18:37:49 UTC] {harvest.py:355} INFO - actual operations completed [2024-02-13, 18:37:49 UTC] {harvest.py:356} INFO - {'deleted': 0, 'updated': 0, 'created': 937, 'nothing': 45} [2024-02-13, 18:37:49 UTC] {harvest.py:359} INFO - validity of the records [2024-02-13, 18:37:49 UTC] {harvest.py:360} INFO - {'valid': 982, 'invalid': 0}
The load test was performed and Airflow handled the job as expected. As our conversation about our use of the tool has evolved, the team has decided to pivot away from using Airflow--at least in the interim--due to the high cost of support in terms of infrastructure cost as well as time to learn the platform, versus the advantages that it was expected to bring. In short, our use case (high throughput, minimal analysis) does not overlap as nicely with Airflow's strengths as we'd expected.
User Story
In order to incorporate updates to the datagov-harvesting-logic API, and feedback from the most recent design sessions, changes need to be made to the Airflow ETL pipeline DAG in order to fully test a DCAT-US record end-to-end.
Related:
4577
4578
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
Background
[Any helpful contextual notes or links to artifacts/evidence, if needed]
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch
Reference