ccodwg / Covid19CanadaArchive

Canadian COVID-19 Data Archive
https://opencovid.ca
Other
22 stars 10 forks source link

Remove duplicate files #183

Closed jeanpaulrsoucy closed 7 months ago

jeanpaulrsoucy commented 3 years ago

This issue was created to centralize various issues around deleting duplicate files.

Due to an error, the archival tool ran mid-day on 2021-02-02 in addition to the usual 10 pm ET nightly update. Thus, there are duplicated files for all the datasets that were in datasets.json at that time. Careful use of the AWS CLI should be able to purge/move these unnecessary files. In some cases, there may actually be two versions of a dataset for a particular date (e.g., due to mid-day updates or corrections). Since February 2, 2021 was a Tuesday, these files are unlikely to have misidentified "corrected" dates (e.g., as they might for datasets that don't normally update on the weekend. Any datasets with 2 unique versions for 2021-02-02 should be carefully dealt with.

As above. Note that one file failed (4 total - can/vaccination-coverage-keypops/vaccination-coverage-keypops.csv) in this earlier update that did not fail (3 total) in the update that was run intentionally (2021-07-23-xx).

Prod was run twice on this day, since an unreliable Internet connection caused unpredictable failures in the night's update. Thus, most of the files have two versions. The duplicated versions have timestamps of 22-xx or 23-xx, whereas the second run has timestamps of 22-xx or 23-xx.

As above. Original files have timestamps of 22-xx and the second run files have timestamps of 23-xx.

As above.

As above.

As above.

As above.

As above.

As above.

From re-running some HTML files with explicit JS after cbd38a25b9f73c811cf98d944476c07b373813e3.

A partial list of these files is included below.

jeanpaulrsoucy commented 3 years ago

Will removing duplicates (not counting duplicates past the final version of the file) affect the calculation of corrected dates?

mschoettle commented 3 years ago

QC's synthese-7jours.csv, cas-region.csv and cas-region-7jours.csv are not shown on the data page anymore and therefore will lead to duplicate files as well.

Since July 12, 2021, the summary table has only published in the balance sheet press release on Mondays and upon return from public holidays, specifically to present the data for the previous days.

jeanpaulrsoucy commented 2 years ago

One specific odd case: on/toronto-cases

All the files are identical (md5: 2563dbcdbccf86f595034b455d9e0e9b) but the files were added in different ways (manual upload, copy, usual script), creating the different etags.

jeanpaulrsoucy commented 7 months ago

Should be solved by #267.