ASFHyP3 / OpenData

Supporting our datasets available via AWS OpenData
1 stars 0 forks source link

Verify contents of `its-live-open` and `its-live-project` buckets #15

Closed jtherrmann closed 9 months ago

jtherrmann commented 10 months ago

@asjohnston-asf Provided text files at s3://asj-dev/opendata/ containing sorted project keys from the its-live-data, its-live-open, and its-live-project bucket inventory reports.

jtherrmann commented 10 months ago

I noticed that its-live-open contained extra prefixes, from when we were attempting to sync the entirety of its-live-data to its-live-open before we had narrowed down the list of prefixes that should be transferred. I removed the extra prefixes:

aws --profile opendata-its-live s3 rm --recursive --only-show-errors s3://its-live-open/L7_PV_fix/
aws --profile opendata-its-live s3 rm --recursive --only-show-errors s3://its-live-open/NSIDC/
aws --profile opendata-its-live s3 rm --recursive --only-show-errors s3://its-live-open/Test/
aws --profile opendata-its-live s3 rm --recursive --only-show-errors s3://its-live-open/catalog_geojson_latest/
aws --profile opendata-its-live s3 rm --recursive --only-show-errors s3://its-live-open/catalog_geojson_original/

I then confirmed that its-live-open contains only the expected prefixes:

$ aws s3 ls s3://its-live-open/                    
                           PRE autorift_parameters/
                           PRE catalog_geojson/
                           PRE composites/
                           PRE datacubes/
                           PRE mosaics/
                           PRE rgb_mosaics/
                           PRE vel_web_tiles/
                           PRE velocity_image_pair/
$ aws s3 ls s3://its-live-open/velocity_image_pair/
                           PRE landsatOLI/
                           PRE sentinel1/
                           PRE sentinel2/
jtherrmann commented 10 months ago

I created a list of the objects we should expect to see in its-live-open based on the list of prefixes under "(2) user data" at https://github.com/ASFHyP3/OpenData/issues/10#issuecomment-1850890480:

grep -e '^autorift_parameters/' -e '^catalog_geojson/' -e '^composites/' -e '^datacubes/' -e '^mosaics/' -e '^rgb_mosaics/' -e '^vel_web_tiles/' -e '^velocity_image_pair/landsatOLI/' -e '^velocity_image_pair/sentinel1/' -e '^velocity_image_pair/sentinel2/' data_keys_sorted_20231218T0100.txt > expected_open_keys_from_data_keys.txt

I also created a list of the actual contents of its-live-open, filtering out the extra prefixes that were deleted (see above):

grep -v -e '^L7_PV_fix/' -e '^NSIDC/' -e '^Test/' -e '^catalog_geojson_latest/' -e '^catalog_geojson_original/' open_keys_sorted_20231217T0100.txt > open_keys_filtered.txt

I confirmed that the expected contents match the actual contents:

$ du expected_open_keys_from_data_keys.txt open_keys_filtered.txt 
52164724        expected_open_keys_from_data_keys.txt
52164724        open_keys_filtered.txt
$ sha256sum expected_open_keys_from_data_keys.txt open_keys_filtered.txt 
1fcc1c297301eaface0c2c93db9e1879ae828ee99d0165ca57374591c2d5ce08  expected_open_keys_from_data_keys.txt
1fcc1c297301eaface0c2c93db9e1879ae828ee99d0165ca57374591c2d5ce08  open_keys_filtered.txt
jtherrmann commented 9 months ago

I ran https://github.com/ASFHyP3/OpenData/blob/batch-transfer/batch-transfer/check_sizes.py to calculate the total size of the s3://its-live-open contents, as well as the expected total size (from the s3://its-live-data contents that were transferred into its-live-open). The two values exactly match:

Final output line from python check_sizes.py expected-open:

Total size: 97618572102677

Final output line from python check_sizes.py actual-open:

Total size: 97618572102677
jtherrmann commented 9 months ago

We moved some stuff around today, per https://github.com/ASFHyP3/OpenData/issues/10#issuecomment-1850890480. Since the checklist of prefixes in that comment is getting a bit complex, I copied it into a text editor and re-grouped all of the prefixes into project prefixes, user prefixes, and prefixes to be deleted.

I then listed the its-live-project and its-live-open buckets and confirmed that the prefixes that are currently present in each bucket (as of today, 2023-12-21) exactly match the prefix lists that I generated based on the checklist from https://github.com/ASFHyP3/OpenData/issues/10#issuecomment-1850890480.

This gives me greater confidence that we've transferred everything correctly. We should get someone from ITS_LIVE to approve these lists of prefixes, and then do a final verification of keys and total bucket size (using S3 inventory reports), similar to what we did above.

Here are the lists:

Contents of its-live-project bucket:

L7_PV_fix/
Test/
elevation/
isce_autoRIFT/
month-data-logs/
s3-inventory/
test/
test_datacubes/
    test_datacubes/forAlex/
    test_datacubes/mosaics/
    test_datacubes/s1_correction/
    test_datacubes/validate_v2_granule_crop/
velocity_image_pair/
    velocity_image_pair/landsatOLI-latest/
    velocity_image_pair/sentinel1-backup/
    velocity_image_pair/sentinel1-corrected-8granules/
    velocity_image_pair/sentinel1-corrected/
    velocity_image_pair/sentinel1-latest/
    velocity_image_pair/sentinel2-latest/

Contents of its-live-open bucket:

autorift_parameters/
catalog_geojson/
composites/
datacubes/
documentation/
height_change/
ice_masks/
mosaics/
qgis_project/
rgb_mosaics/
vel_web_tiles/
velocity_image_pair/
    velocity_image_pair/landsatOLI/
    velocity_image_pair/sentinel1/
    velocity_image_pair/sentinel2/
velocity_mosaic/

Will be deleted, along with anything else in the its-live-data bucket that was not transferred to one of the other two buckets:

NSIDC/
ice_shelf/

TODO:

jtherrmann commented 9 months ago

Updated https://github.com/ASFHyP3/OpenData/blob/batch-transfer/batch-transfer/check_sizes.py and re-ran it to compare the total size of the its-live-open bucket against the transferred prefixes from the its-live-data bucket using the latest inventory reports. Output shows an exact match, with a total size of 98131541328013 for both buckets, for the relevant prefixes.