broadinstitute / imaging-backup-scripts

Scripts to backup data for the Imaging Platform
MIT License
1 stars 3 forks source link

restore_intelligent missing some files? #23

Open ErinWeisbart opened 2 years ago

ErinWeisbart commented 2 years ago

I'm transferring from one bucket to another. After un-archive with restore_intelligent.py and transfer with aws s3 sync, using the S3 console to Calculate total size I get 17289 objects in the source bucket and 17287 objects in the destination bucket. I confirmed the 17289 objects in the source bucket with aws s3 ls --human-readable --summarize.

If I run restore_intelligent on the source bucket it returns 17287 total files found pre-filtering (I didn't wait for it to run through the actual restoration again). So why isn't it finding all 17289 files that are actually there?

If I try aws s3 sync again it returns An error occurred (InvalidObjectState) when calling the UploadPartCopy operation: Operation is not valid for the source object's access tier for two files.

If I directly call those files one at a time with restore_intelligent it returns 1 total files found pre-filtering and REQUESTED 1 for each. If try again shortly thereafter it shows IN_PROGRESS 1. These do seem to be the missing files as they don't exist in the destination bucket, I can't download them from the source bucket because An error occurred (InvalidObjectState) when calling the GetObject operation: The operation is not valid for the object's access tier and I can download adjacent images from the source bucket.

So if these are actually the files it's missing, why does it find them if I call them directly?

(e.g. is Stain2_Batch2_Confocal)

ErinWeisbart commented 2 years ago

I've replicated the behavior on an additional folder (e.g. Stain2_Batch2_MitoCompare, 1 file difference)

ErinWeisbart commented 2 years ago

Similarly, I'm finding examples where restore_intelligent is listing a different number of files without a aws s3 sync throwing errors on anything. e.g. (QCImages) where console and aws s3 ls shows 2939 in source, 2937 in destination after sync. restore_intelligent on source returns 2930 total files found pre-filtering with RESTORED 2930.

bethac07 commented 2 years ago

My strong suspicion is that sometimes folders "count", and sometimes they don't. It seems to be that an uploaded folder is an object, but then once moved is not, IME. That probably explains the cases with no errors, though I'm not sure there's any way to confirm other than just diffing the two.

Are there any patterns you've discerned so far about what the object names are?

ErinWeisbart commented 2 years ago

I'm guessing you're right that the second set without errors does have something to do with folders being size 0 objects sometimes. The first set is still odd. I haven't replicated it beyond those 2 folders with 2/1 errors, respectively. I'll keep an eye out for more, but we can probably close this in the meantime. Those two folders are alphabetically almost at the beginning of the giant list of folders that you recently un-archived so perhaps they were unarchived with those errors before you finished all the awesome improvements to restore_intelligent and slipped through the cracks.