Azure / azure-storage-azcopy

The new Azure Storage data transfer utility - AzCopy v10
MIT License
607 stars 219 forks source link

Filtering attributes/tags/access tier when transfering blobs between storage accounts #2621

Open catalin-micu opened 6 months ago

catalin-micu commented 6 months ago

AzCopy 10.23

Linux OS

azcopy copy "source_storage_account_container" "destination_storage_account_container" --recursive

Problem: Copying entire storage containers and using azcopy to filter some blobs

There is an unpredictible amount of data, scattered throughout the container, that we want to filter out. We are talking about petabytes worth of data in total. We can identifiy all the data that needs to be filtered. Due to internal policies, we cannot alter the data (cannot rename/add prefix or anything of the sort , therefore cannot use --exclude-pattern or --exclude-regex), nor can we archive it. These two options are out of the question.

What I want to do is filter data in a storage account to storage account transfer, through azcopy copy, based on either a tag, or access tier (everything is currently hot tier, but unwanted data can be moved to cool or cold) or any other blob attribute that can be assigned to the data, without changing names, directory structure or archiving.

Can this be done?

souravgupta-msft commented 6 months ago

Hi @catalin-micu, filtering blobs using tags or access tier is currently not supported. You can use one of the below ways for filtering blobs during copy.

catalin-micu commented 6 months ago

All my data is Block blob at the moment. Is there a way to change that?

souravgupta-msft commented 6 months ago

Do you mean changing the blob type from Block Blob to Append Blob or Page Blob? If yes, then there is no direct way to do that. Can you use last modified time for filtering the blobs during copy?

catalin-micu commented 6 months ago

Yes, I meant changing the blob type, I understand it's not possible. I can't use the last modified timestamp either, because there is no pattern for uploading this data that I'm trying to filter. Situation is like this: over the course of years, from time to time, wrong data was uploaded. Now I'm need to move the whole content of the storage account, preferably filtering this wrong data. About the wrong data the only thing I can find out is the directory name. All directory names (for both good data and bad data) are UUIDs, so can't use any pattern filtering there. We are talking hundreds to thousands of directories, so adding each name I'm trying to filter in the AzCopy command is also not an option.

Is there anything else worth trying? I was leaning towards filtering based on blob tags or blob attributes, but it does not seem possible

souravgupta-msft commented 6 months ago

What blob attribute do you want to use for filtering the wrong data (other than tags or access tier)?

catalin-micu commented 6 months ago

I don't have any in mind, basically anything that I can set to a specific value for all the wrong data, then pass said value to azcopy to filter be it blob property, directory property, anything

catalin-micu commented 6 months ago

Alright, I see a feature-request label was added; to summarize, I would best like to filter by access tier

schoag-msft commented 6 months ago

Blob Inventory (https://learn.microsoft.com/azure/storage/blobs/blob-inventory) captures metadata/attributes on objects like Access Tier. You could use a Blob Inventory report as a input to AzCopy with the --list-of-files param (https://github.com/Azure/azure-storage-azcopy/wiki/Listing-specific-files-to-transfer).

catalin-micu commented 6 months ago

Interesting solution, but sadly, it won't work because of the performance issues. The resulting list of files would have millions of entries, every time, for multiple transer jobs I will do (200+)