It turns out the local mirror actually has extra files. It turns out these are hidden temporary files that were not properly removed. They can be removed by syncing with the --delete tag. After this, the number of files matches:
> find . -type f | wc -l && find . -type f -exec stat --format=%s {} + | awk '{s+=$1} END {printf "%.2f GB\n", s/1024/1024/1024}'
193785
243.88 GB
The file numbers match. We can also make ensure the file lists match.
S3_BUCKET_PATH="s3://data.opencovid.ca/archive"
LOCAL_DIRECTORY_PATH="."
S3_FILE_LIST=$(mktemp)
LOCAL_FILE_LIST=$(mktemp)
aws s3 ls "${S3_BUCKET_PATH}/" --recursive | awk '{$1=$2=$3=""; sub(/^ +/, ""); print}' | grep -v '/$' | sed 's|^archive/||' | sort > "${S3_FILE_LIST}"
find "${LOCAL_DIRECTORY_PATH}" -type f | sed 's|^\./||' | sort > "${LOCAL_FILE_LIST}"
comm -12 <(cat "${S3_FILE_LIST}") <(cat "${LOCAL_FILE_LIST}") > ~/Desktop/common_files.txt
comm -23 <(cat "${S3_FILE_LIST}") <(cat "${LOCAL_FILE_LIST}") > ~/Desktop/only_in_s3.txt
comm -13 <(cat "${S3_FILE_LIST}") <(cat "${LOCAL_FILE_LIST}") > ~/Desktop/only_in_local.txt
echo "Files in both S3 and local: $(wc -l < ~/Desktop/common_files.txt)"
echo "Files only in S3 (missing locally): $(wc -l < ~/Desktop/only_in_s3.txt)"
echo "Files only in local (not in S3): $(wc -l < ~/Desktop/only_in_local.txt)"
The result:
Files in both S3 and local: 193785
Files only in S3 (missing locally): 0
Files only in local (not in S3): 0
The same results are obtained on the backup of ccodwg-archive.
Finally, I ran a variety of other data checks after uploading all of the Archive's files to archive.org. Python code: ia-checks.txt
Here is the final reported bucket size for
data.opencovid.ca
according to the Metrics tab:Number of objects: 193,833 Size: 243.9 GB
Here is the number of objects reported by the S3 CLI:
In the local mirror of the S3 bucket, the following results are returned:
Are there missing files in the local mirror? Let's list the objects unique to S3.
Seemingly a handful of random folders are counted as objects.
Let's modify the original commands to list objects on S3 excluding those ending in "/", which are folder objects.
It turns out the local mirror actually has extra files. It turns out these are hidden temporary files that were not properly removed. They can be removed by syncing with the
--delete
tag. After this, the number of files matches:The file numbers match. We can also make ensure the file lists match.
The result:
The same results are obtained on the backup of
ccodwg-archive
.Finally, I ran a variety of other data checks after uploading all of the Archive's files to archive.org. Python code: ia-checks.txt