I've removed the duplicate rows, but I wasn't sure how to best handle the variables with multiple entries. There didn't seem to be a clear indication of which were the "correct" rows, so I've just kept the most recent ones.
Previously USAFactsDeaths inherited from USAFactsCases so when super().__init__() was called, it would call the init of USAFactsCases, which would call EtagCacheMixin.initialize_cache() inside the cases scraper object. This caused the deaths scraper to have the same cache_file as the cases scraper. This restructures things so that the cases and deaths scraper inherit from an abstract base class, removing this issue.
The CDC testing dataset now includes a lot of duplicate rows and has multiple values for some (about 65,000) variable entries. Some context is here: https://trello.com/c/9xHrKqo4/137-duplicates-in-cdc-testing-dataset-causing-scraper-to-fail
I've removed the duplicate rows, but I wasn't sure how to best handle the variables with multiple entries. There didn't seem to be a clear indication of which were the "correct" rows, so I've just kept the most recent ones.
Previously
USAFactsDeaths
inherited fromUSAFactsCases
so whensuper().__init__()
was called, it would call the init ofUSAFactsCases
, which would callEtagCacheMixin.initialize_cache()
inside the cases scraper object. This caused the deaths scraper to have the samecache_file
as the cases scraper. This restructures things so that the cases and deaths scraper inherit from an abstract base class, removing this issue.