data-preservation-programs / slingshot

Official public repository for feedback and data collection in Filecoin Slingshot
https://slingshot.filecoin.io
68 stars 250 forks source link

[v3] some datasets are smaller than is written ("/" problem) #542

Closed aueam closed 1 year ago

aueam commented 1 year ago

Dataset information

How I think bad number came about (awscli): aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021 --recursive --human-readable --summarize --no-sign-request output: 596.3 GiB (here it secretly counted cvpr_challenge_2021_full extra) aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021_full --recursive --human-readable --summarize --no-sign-request output: 298.2 GiB aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_paper_2020 --recursive --human-readable --summarize --no-sign-request output: 19.6 GiB

596.3 + 298.2 + 19.6 = 914.1 GiB

What I think is the right way (awscli): aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021/ --recursive --human-readable --summarize --no-sign-request output: 298.1 GiB aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021_full/ --recursive --human-readable --summarize --no-sign-request output: 298.2 GiB aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_paper_2020/ --recursive --human-readable --summarize --no-sign-request output: 19.6 GiB

298.1 + 298.2 + 19.6 = 615.9 GiB It would be fine to give after all the buckets (all in slingshot v3) / and then recalculate all dataset sizes again.

Thank you for the fast correction

orvn commented 1 year ago

Reviewing this, but the diff does show a discrepancy of 4423 files when excluding the trailing slash:

s3://intelinair-data-releases/agriculture-vision/cvpr_paper_2020 /

aueam commented 1 year ago

Did you mean cvpr_challenge_2021? Because cvpr_paper_2020 always shows me 9 objects with and without /

[example@example]$ aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_paper_2020/ --recursive --human-readable --summarize --no-sign-request
2021-03-08 22:42:10    0 Bytes agriculture-vision/cvpr_paper_2020/
2021-03-12 12:22:25   45.4 KiB agriculture-vision/cvpr_paper_2020/Agriculture-Vision Dataset Terms of Use.pdf
2021-03-12 12:22:24    5.5 MiB agriculture-vision/cvpr_paper_2020/Agriculture-Vision- A Large Aerial Image Database for Agricultural Pattern Analysis.pdf
2021-03-12 12:22:25    1.8 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2017_miniscale.tar.gz
2021-03-12 12:22:25   13.9 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2017_splits.json
2021-03-12 12:22:25   12.7 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2018_miniscale.tar.gz
2021-03-12 12:22:25   29.4 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2018_splits.json
2021-03-12 12:22:25    5.1 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2019_miniscale.tar.gz
2021-03-12 12:22:25   27.2 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2019_splits.json

Total Objects: 9
   Total Size: 19.6 GiB
[example@example]$ aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_paper_2020 --recursive --human-readable --summarize --no-sign-request
2021-03-08 22:42:10    0 Bytes agriculture-vision/cvpr_paper_2020/
2021-03-12 12:22:25   45.4 KiB agriculture-vision/cvpr_paper_2020/Agriculture-Vision Dataset Terms of Use.pdf
2021-03-12 12:22:24    5.5 MiB agriculture-vision/cvpr_paper_2020/Agriculture-Vision- A Large Aerial Image Database for Agricultural Pattern Analysis.pdf
2021-03-12 12:22:25    1.8 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2017_miniscale.tar.gz
2021-03-12 12:22:25   13.9 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2017_splits.json
2021-03-12 12:22:25   12.7 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2018_miniscale.tar.gz
2021-03-12 12:22:25   29.4 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2018_splits.json
2021-03-12 12:22:25    5.1 GiB agriculture-vision/cvpr_paper_2020/Dataset/data2019_miniscale.tar.gz
2021-03-12 12:22:25   27.2 KiB agriculture-vision/cvpr_paper_2020/Dataset/data2019_splits.json

Total Objects: 9
   Total Size: 19.6 GiB

Could you please be more specific about those discrepancies?

orvn commented 1 year ago

@aueam, sorry, I meant 2021, yes.

Just checking the file names against each other, regardless of the exact path, we can diff all the files from the naked (no slash) path against the concatenated trailing / slash and _full.

diff -c \
<( aws s3 ls "s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021"  --recursive --human-readable --summarize --no-sign-request | grep -Eo "^\s?[0-9]{4}.+" ) \
<( cat \
  <(aws s3 ls "s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021/" --recursive --human-readable --summarize --no-sign-request)\
  <(aws s3 ls s3://intelinair-data-releases/agriculture-vision/cvpr_challenge_2021_full --recursive --human-readable --summarize --no-sign-request) | grep -Eo "^\s?[0-9]{4}.+" ) \
| grep -E "^-" \
| awk '{ print substr($0, index($0,$6)) }' \
| grep -Eoi "[^/]+.[a-z0-9]{2,4}$" \
| sort | uniq | grep -Ev "\t"

Since this yields no results, there are 0 net new files in the 914.1 GiB vs the 615.9 GiB. So you're correct, 615.9 GiB is the right size.

@dkkapur so if we're scoping by directory within a bucket, we want to make sure that we always include a trailing / in the path.

orvn commented 1 year ago

@aueam, the dataset is updated with the new size! (you might have to refresh your the web app to reflect the latest changes)

Thanks for reporting this issue.