act-now-coalition / can-scrapers

MIT License
9 stars 13 forks source link

Remove duplicates in CDC Testing scraper and fix inheritance in USAFacts scrapers #391

Closed smcclure17 closed 2 years ago

smcclure17 commented 2 years ago

The CDC testing dataset now includes a lot of duplicate rows and has multiple values for some (about 65,000) variable entries. Some context is here: https://trello.com/c/9xHrKqo4/137-duplicates-in-cdc-testing-dataset-causing-scraper-to-fail

I've removed the duplicate rows, but I wasn't sure how to best handle the variables with multiple entries. There didn't seem to be a clear indication of which were the "correct" rows, so I've just kept the most recent ones.

Previously USAFactsDeaths inherited from USAFactsCases so when super().__init__() was called, it would call the init of USAFactsCases, which would call EtagCacheMixin.initialize_cache() inside the cases scraper object. This caused the deaths scraper to have the same cache_file as the cases scraper. This restructures things so that the cases and deaths scraper inherit from an abstract base class, removing this issue.