edanalytics / edu_edfi_airflow

Manages extract-load of Ed-Fi data in Airflow
Other
4 stars 0 forks source link

Feature/pull all deletes #76

Open sleblanc23 opened 1 week ago

sleblanc23 commented 1 week ago

Description & motivation

This PR addresses a potential source of error in our Ed-Fi data pulls. When a record is deleted but then a new record is created with the same natural key, the delete won't be returned by an API call pulling recent change versions. This can cause us to miss delete records and end up with orphan records. These are typically handled by the deduplication step in the edu_edfi_source staging models, but we could still surface these records if the newest version is also deleted. To avoid this scenario, we need to pull all of an endpoint's deletes whenever a new delete is recorded.

Internal reference doc

PR Merge Priority:

Changes to existing files:

Tests and QC done

Successfully ran in GSN dev with all three run types. Will run it on a schedule for the next week or so to compare performance to prod

Questions / discussion points

sleblanc23 commented 3 days ago

@jayckaiser I checked this branch out in GSN for the week and it appears to have no impact on performance there (literally - runs averaged 9:12 for the last 4 days, 9:12 for the 10 days prior to that). GSN of course has very small districts who are not all pushing data regularly, so we should probably test in SC for at least a night or two.