Open sleblanc23 opened 1 week ago
@jayckaiser I checked this branch out in GSN for the week and it appears to have no impact on performance there (literally - runs averaged 9:12 for the last 4 days, 9:12 for the 10 days prior to that). GSN of course has very small districts who are not all pushing data regularly, so we should probably test in SC for at least a night or two.
Description & motivation
This PR addresses a potential source of error in our Ed-Fi data pulls. When a record is deleted but then a new record is created with the same natural key, the delete won't be returned by an API call pulling recent change versions. This can cause us to miss delete records and end up with orphan records. These are typically handled by the deduplication step in the
edu_edfi_source
staging models, but we could still surface these records if the newest version is also deleted. To avoid this scenario, we need to pull all of an endpoint's deletes whenever a new delete is recorded.Internal reference doc
PR Merge Priority:
Changes to existing files:
edfi_resource_dag.py
: Adds class-level argpull_all_deletes
. When True, all deletes will be pulled for any endpoint with new deletes since the last successful run. Previous delete records are deleted from Snowflake before the new records are copied in.Tests and QC done
Successfully ran in GSN dev with all three run types. Will run it on a schedule for the next week or so to compare performance to prod
Questions / discussion points