ccodwg / Covid19CanadaArchive

Canadian COVID-19 Data Archive
https://opencovid.ca
Other
22 stars 10 forks source link

Retire S3 bucket #301

Closed jeanpaulrsoucy closed 5 months ago

jeanpaulrsoucy commented 6 months ago

Here is the final reported bucket size for data.opencovid.ca according to the Metrics tab:

Number of objects: 193,833 Size: 243.9 GB

Here is the number of objects reported by the S3 CLI:

> aws s3 ls s3://data.opencovid.ca/archive --recursive | wc -l 
193833

> aws s3api list-objects-v2 --bucket data.opencovid.ca --prefix archive/ --query 'Contents[].{Key: Key}' --output text | wc -l
193833

In the local mirror of the S3 bucket, the following results are returned:

> find . -type f | wc -l && find . -type f -exec stat --format=%s {} + | awk '{s+=$1} END {printf "%.2f GB\n", s/1024/1024/1024}'
193798
243.98 GB

Are there missing files in the local mirror? Let's list the objects unique to S3.

> comm -23 <(aws s3 ls s3://data.opencovid.ca/archive/ --recursive | awk '{print $4}' | sed 's|^archive/||' | sort) <(find . -type f | sed 's|^\./||' | sort)
ab/ab-provincial-summary-webpage/
ab/case-time-series-by-lga/
ab/vaccine-coverage-by-lga/
bc/7-day-and-cumulative-cases-by-hsda/
bc/7-day-and-cumulative-cases-by-hsda-2/
bc/bc-covid-data-webpage/
bc/case-testing-vaccine-summary-by-CHSA-and-LHA/
bc/testing-timeseries-by-rha-2/
bc/voc-time-series-by-rha/
can/aefi-weekly-summary-by-serious-event-type-old/
can/aefi-weekly-summary-old/
can/phac-weekly-epidemiological-report-english/
can/phac-weekly-epidemiological-report-french/
nb/vaccine-coverage-by-age/
nb/vaccine-time-series/
nl/cumulative-vaccination-2/
nl/vaccine-doses-received-and-expected/
ns/weekly-data/
nt/nwt-dashboard-cases-webpage/
nu/nunavut-vaccination-table/
on/deaths-involving-covid-by-fatality-type/
on/deaths-involving-covid-by-fatality-type/supplementary/
on/deaths-involving-covid-by-vaccination-status/
on/deaths-involving-covid-by-vaccination-status/supplementary/
on/ices-vaccine-coverage-by-age-group-and-fsa/
on/on-phu-york-individual-level-case-data/supplementary/
on/on-phu-york-individual-level-case-data/supplementary/Technical
on/ottawa-community-outbreaks-json/
on/ottawa-wards-cases-cumulative/
on/rapid-testing-participating-locations-csv/
on/toronto-active-outbreaks/supplementary/Workplace
on/toronto-covid-summary/supplementary/External
on/toronto-daily-status/supplementary/External
on/toronto-ethno-racial-income/supplementary/Technical
on/toronto-monitoring-dashboard/supplementary/Toronto
on/toronto-neighbourhood-data/supplementary/External
on/toronto-neighbourhood-test-data/supplementary/External
other/can/healthy-debate-vaccine-rollout-dataset/
other/can/healthy-debate-vaccine-rollout-summary/
pe/vaccine-data-cumulative/
qc/first-vaccine-dose-appointments-by-age-group/
qc/first-vaccine-dose-appointments-by-rss/
qc/inspq-data-webpage/
qc/montreal-vaccine-administration-time-series-v2/
qc/vaccination-by-age-group/
qc/vaccine-doses-admin-by-age-group-time-series/
qc/vaccine-doses-admin-by-rss-time-series/
qc/variant-screening-time-series-by-rss-cumulative/
qc/variant-screening-time-series-by-rss-weekly/
sk/covid-weekly-epi-report-news-release-webpage/
sk/sk-13-zones-map/
sk/sk-32-subzones-map/
sk/vaccination-by-region-highlights-charts-tables/
sk/vaccination-by-region-highlights-charts-tables-legacy/
yt/yukon-vaccine-tracker-webpage/

Seemingly a handful of random folders are counted as objects.

Let's modify the original commands to list objects on S3 excluding those ending in "/", which are folder objects.

> aws s3 ls s3://data.opencovid.ca/archive --recursive | grep -v '/$' | wc -l
193785

It turns out the local mirror actually has extra files. It turns out these are hidden temporary files that were not properly removed. They can be removed by syncing with the --delete tag. After this, the number of files matches:

> find . -type f | wc -l && find . -type f -exec stat --format=%s {} + | awk '{s+=$1} END {printf "%.2f GB\n", s/1024/1024/1024}'
193785
243.88 GB

The file numbers match. We can also make ensure the file lists match.

S3_BUCKET_PATH="s3://data.opencovid.ca/archive"
LOCAL_DIRECTORY_PATH="."
S3_FILE_LIST=$(mktemp)
LOCAL_FILE_LIST=$(mktemp)

aws s3 ls "${S3_BUCKET_PATH}/" --recursive | awk '{$1=$2=$3=""; sub(/^ +/, ""); print}' | grep -v '/$' | sed 's|^archive/||' | sort > "${S3_FILE_LIST}"
find "${LOCAL_DIRECTORY_PATH}" -type f | sed 's|^\./||' | sort > "${LOCAL_FILE_LIST}"

comm -12 <(cat "${S3_FILE_LIST}") <(cat "${LOCAL_FILE_LIST}") > ~/Desktop/common_files.txt
comm -23 <(cat "${S3_FILE_LIST}") <(cat "${LOCAL_FILE_LIST}") > ~/Desktop/only_in_s3.txt
comm -13 <(cat "${S3_FILE_LIST}") <(cat "${LOCAL_FILE_LIST}") > ~/Desktop/only_in_local.txt

echo "Files in both S3 and local: $(wc -l < ~/Desktop/common_files.txt)"
echo "Files only in S3 (missing locally): $(wc -l < ~/Desktop/only_in_s3.txt)"
echo "Files only in local (not in S3): $(wc -l < ~/Desktop/only_in_local.txt)"

The result:

Files in both S3 and local: 193785
Files only in S3 (missing locally): 0
Files only in local (not in S3): 0

The same results are obtained on the backup of ccodwg-archive.

Finally, I ran a variety of other data checks after uploading all of the Archive's files to archive.org. Python code: ia-checks.txt

>>> print('Integrity:', df['integrity'].sum() == len(df))
Integrity: True
>>> print('Checksums:', df['checksum'].sum() == len(df))
Checksums: True
>>> print('Contents checksums:', df['contents_checksums'].sum() == len(df))
Contents checksums: True
>>> print('Contents file sizes:', df['contents_file_sizes'].sum() == len(df))
Contents file sizes: True