CodeForPhilly / paws-data-pipeline

PAWS Data Pipeline Project
MIT License
17 stars 9 forks source link

588: Delete invalid data from pdp_contacts #619

Closed carrollsa closed 5 months ago

carrollsa commented 6 months ago

Closes https://github.com/CodeForPhilly/paws-data-pipeline/issues/588

Changes

Considerations

carrollsa commented 6 months ago

Local testing on smaller datasets has worked as intended. Full dataset yields the following results. I'm fairly confident we aren't deleting anything unintended, but I'm still running another test to be sure.

image image

image image

carrollsa commented 6 months ago

Made a printout to show the contacts that would be deleted when running on prod data: image

carrollsa commented 6 months ago

I just updated it to also include Jane Does, as prod data has 3. There are also instances of first_name = "NONAME", last_name = "(RED FLAG)" and a julia NONAME.

carrollsa commented 6 months ago

Latest changes will remove the following:

image