Closed aueam closed 2 years ago
Reviewing this, but the diff
does show a discrepancy of 4423
files when excluding the trailing slash:
s3://intelinair-data-releases/agriculture-vision/cvpr_paper_2020
/
Did you mean cvpr_challenge_2021
? Because cvpr_paper_2020
always shows me 9 objects with and without /
[example@example]$ aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_paper_2020/ --recursive --human-readable --summarize --no-sign-request
2021-03-08 22:42:10 0 Bytes agriculture-vision/cvpr_paper_2020/
2021-03-12 12:22:25 45.4 KiB agriculture-vision/cvpr_paper_2020/Agriculture-Vision Dataset Terms of Use.pdf
2021-03-12 12:22:24 5.5 MiB agriculture-vision/cvpr_paper_2020/Agriculture-Vision- A Large Aerial Image Database for Agricultural Pattern Analysis.pdf
2021-03-12 12:22:25 1.8 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2017_miniscale.tar.gz
2021-03-12 12:22:25 13.9 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2017_splits.json
2021-03-12 12:22:25 12.7 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2018_miniscale.tar.gz
2021-03-12 12:22:25 29.4 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2018_splits.json
2021-03-12 12:22:25 5.1 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2019_miniscale.tar.gz
2021-03-12 12:22:25 27.2 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2019_splits.json
Total Objects: 9
Total Size: 19.6 GiB
[example@example]$ aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_paper_2020 --recursive --human-readable --summarize --no-sign-request
2021-03-08 22:42:10 0 Bytes agriculture-vision/cvpr_paper_2020/
2021-03-12 12:22:25 45.4 KiB agriculture-vision/cvpr_paper_2020/Agriculture-Vision Dataset Terms of Use.pdf
2021-03-12 12:22:24 5.5 MiB agriculture-vision/cvpr_paper_2020/Agriculture-Vision- A Large Aerial Image Database for Agricultural Pattern Analysis.pdf
2021-03-12 12:22:25 1.8 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2017_miniscale.tar.gz
2021-03-12 12:22:25 13.9 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2017_splits.json
2021-03-12 12:22:25 12.7 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2018_miniscale.tar.gz
2021-03-12 12:22:25 29.4 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2018_splits.json
2021-03-12 12:22:25 5.1 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2019_miniscale.tar.gz
2021-03-12 12:22:25 27.2 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2019_splits.json
Total Objects: 9
Total Size: 19.6 GiB
Could you please be more specific about those discrepancies?
@aueam, sorry, I meant 2021
, yes.
Just checking the file names against each other, regardless of the exact path, we can diff
all the files from the naked (no slash) path against the concatenated trailing /
slash and _full
.
diff -c \
<( aws s3 ls "s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021" --recursive --human-readable --summarize --no-sign-request | grep -Eo "^\s?[0-9]{4}.+" ) \
<( cat \
<(aws s3 ls "s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021/" --recursive --human-readable --summarize --no-sign-request)\
<(aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021_full --recursive --human-readable --summarize --no-sign-request) | grep -Eo "^\s?[0-9]{4}.+" ) \
| grep -E "^-" \
| awk '{ print substr($0, index($0,$6)) }' \
| grep -Eoi "[^/]+.[a-z0-9]{2,4}$" \
| sort | uniq | grep -Ev "\t"
Since this yields no results, there are 0 net new files in the 914.1 GiB
vs the 615.9 GiB
. So you're correct, 615.9 GiB
is the right size.
@dkkapur so if we're scoping by directory within a bucket, we want to make sure that we always include a trailing /
in the path.
@aueam, the dataset is updated with the new size! (you might have to refresh your the web app to reflect the latest changes)
Thanks for reporting this issue.
Dataset information
Dataset name: AgricultureVision
Dataset slug: agriculturevision
Dataset state:
Your handle on Filecoin Slack: Maros Telka
Change requested
Correct dataset size:
[ ] 914.1 GiB
[x] 615.9 GiB
Why is this change being requested?:
Because 914.1 GiB is simply wrong.
How I think bad number came about (
awscli
):aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021 --recursive --human-readable --summarize --no-sign-request
output: 596.3 GiB (here it secretly counted cvpr_challenge_2021_full extra)aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021_full --recursive --human-readable --summarize --no-sign-request
output: 298.2 GiBaws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_paper_2020 --recursive --human-readable --summarize --no-sign-request
output: 19.6 GiB596.3 + 298.2 + 19.6 = 914.1 GiB
What I think is the right way (
awscli
):aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021/ --recursive --human-readable --summarize --no-sign-request
output: 298.1 GiBaws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021_full/ --recursive --human-readable --summarize --no-sign-request
output: 298.2 GiBaws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_paper_2020/ --recursive --human-readable --summarize --no-sign-request
output: 19.6 GiB298.1 + 298.2 + 19.6 = 615.9 GiB It would be fine to give after all the buckets (all in slingshot v3)
/
and then recalculate all dataset sizes again.Thank you for the fast correction