Azure / azureml-examples

Official community-driven Azure Machine Learning examples, tested with GitHub Actions.
https://docs.microsoft.com/azure/machine-learning
MIT License
1.73k stars 1.41k forks source link

file/folder upload Option using Datastore class of azure-ai-ml package (AZUREML Python SDK V2) #2071

Open ylnhari opened 1 year ago

ylnhari commented 1 year ago

Describe your suggestion

I am developing ML applications with azure-ai-ml package (AZUREML Python SDK V2) where we have to do frequent uploads to blob storage which is also linked to ML workspace as a datastore, i know that we could use azureblobstorage related SDK's to upload files but how to do it with this newer sdk. After going through the documentation i noticed that there is no such method in the newer SDK (azure-ai-ml package).In AzureML Python SDK V1(azure.core) file uploading method(upload_files_to_datastore()) is provided with datastore class. why is it ommitted in the newer version of the package? Is it intentional ?

Additional details

No response

iamramengirl commented 1 year ago

@ylnhari I'm also looking for the equivalent of upload_files or upload_folder in v2 but can't seem to find any. However, I found the following which may be helpful. This first Notebook example stores an output to a datastore upon submission of a training job. MachineLearningNotebooks/how-to-use-scriptrun.ipynb at master · Azure/MachineLearningNotebooks (github.com) The second sample shows an approach of using BlobClient in order to upload a file to the storage account. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-mltable?tabs=Python-SDK%2Cpandas%2Cdatastore#mltable-file-examples

ylnhari commented 1 year ago

@iamramengirl, Thanks for your help !!! I have already found a way to upload files as blobs using Azure ML Python Version version SDK 2 and successfully used it in my applications.

iamramengirl commented 1 year ago

@ylnhari thank you for your reply. May you share what method you are using in v2 in place of the v1 upload_files/upload_folder? Thank you in advance!

ylnhari commented 1 year ago

@iamramengirl, I wrote a custom function that is very similar to the approach used in the link you have shared. Here is the sample code.

def upload_file_to_blob():
    """Upload a File to Blob storage container using V2 Python SDK."""
    def show_file_progress(uploaded_size: Union[int, float], total_size: Union[int, float]):
        """Print Progress while uploading large Files."""
        bar_total_length = 20
        percentage_uploaded = int((uploaded_size*100) / total_size)
        current_bar_length = int(percentage_uploaded * bar_total_length / 100)
        progress_bar = '|' + '#'*current_bar_length + '|' + str(percentage_uploaded) + '% Completed' 
        print('\r' + progress_bar, end='', flush=True)

    interactive_credential = get_azure_credential('interactive')   # InteractiveBrowserCredential()                                                                  
    blob_service_client = get_blob_Service_client('STORAGE_ACCOUNT_URL', interactive_credential) # BlobServiceClient(STORAGE_ACCOUNT_URL, credential=interactive_credential))
    container_client = get_container_Service_client(blob_service_client, 'BLOB_STORAGE_CONTAINER_NAME') #  blob_service_client.get_container_client(container='BLOB_STORAGE_CONTAINER_NAME')
    file_name_with_folder_structure_on_blob = '{}/{}/{}'.format('parent_folder', 'child_folder', 'file_name.csv')
    file_path_to_read = "./filename.csv"                                                                                                                    
    try:
        # create so-called folder
        blob_client = get_blob_client(container_client, blob_file_name=file_name_with_folder_structure_on_blob)  # container_client.get_blob_client(blob_file_name)
        # upload blob/file
        with open (file_path_to_read, 'rb') as data:
            blob_client.upload_blob(data, progress_hook=show_file_progress)
    except ResourceExistsError as Error: # from azure.core.exceptions import ResourceExistsError
           print(f"you are trying to upload an existing file in the blob")
           break

https://github.com/ylnhari/AzurePythonSDKUtilities/blob/37365eba1281242360463ae2fea6c4ddebf80e8d/AzureStorageServices/azure_blob_storage.py#L92-L115

iamramengirl commented 1 year ago

@ylnhari Thank you for sharing! Very much appreciated. Yes, I agree that's a good approach. The credential may also be passed as an argument in cases when another MSI like service principal or default credential may have to be used.

ylnhari commented 1 year ago

@iamramengirl, in your opinion, do you believe that there should be more comprehensive examples in Microsoft's documentation that detail the process of migrating from V1 SDK to V2 SDK? Including code snippets that compare how to perform specific tasks in both versions could be particularly useful for developers. I would be interested in contributing to such an effort and would appreciate any guidance you can provide on how to get started and ensure that our contributions are accepted.

iamramengirl commented 1 year ago

@ylnhari Thank you for the message! That's a great idea. Azure has always been open to feedback and suggestions. You may submit your suggestions or requests through the following approaches:

  1. Submit feedback on a certain page. For example, the following documentation, in the page footer you will find a link to submit and view feedback for the specific product or the page itself.: Feedback
  2. Visit an Azure tutorial repository, for example: https://github.com/Azure/azureml-examples. Add /contribute at the end of the URL to see the list of feature or enhancement requests. Pick a topic of interest and contribute. Your contribution will be reviewed by the team managing the repository.
  3. Post an idea through the ideation page for a specific product. For example, for Azure Machine Learning: https://feedback.azure.com/d365community/search/?q=azure+machine+learning

Hope you find the above helpful and thanks much again for the interest!

Matthew0x commented 5 months ago

I don't want to sound rude, but how is it possible to release "version 2" of an SDK while simultaneously cutting out the majority of data manipulation/ML ops features out of it (it sounds like regression in this context) ? This blob/container/datastore feature is missing along with e.g. Environment building jobs (not registration) or "wait_for_completion" flag for normal jobs.

ylnhari's approach is great, but it's a separate library/SDK. So it's an architectural issue, the newest "ML SDK" is not sufficient for... ML operations. People just fall back to the old (working) version and the code becomes a mess.

On top of that (slightly offtopic) - it seems the blob library runs into issues using authentication methods from azureml.core (401, lack of token), which is ironic, because it invalidates the whole purpose of building a new authentication class separate from azure.identity. I did try DefaultAzureCredential to see if it picks a token automatically after the initial interactive login - none found/401 failure.

I had tested the azureml.core method using InteractiveLoginAuthentication (so AD/Entra based, which should be supported in blobs according to the DOCs) and the server can't find the token in the request (but this method does work with the v1 and v2 SDK objects!). When used with azure.identity methods, e.g. InteractiveBrowserCredential (so the approach that ylnhari suggested/found), then it works just fine and on top of that the SDK objects/methods are compatible with it.

And finally, more ironically - the examples themselves suggest using the azure.identity instead of the methods inside azureml.core... Why? https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?view=azureml-api-2&tabs=sdk#use-interactive-authentication

bhatsuchi08 commented 2 months ago

Totally agree with Matthew A.'s comments. The SDK v2 seems non-intuitive and painful! I'm ending up wasting hours trying to figure out how to code for basic tasks that were otherwise easier with the V1.