Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.53k stars 2.76k forks source link

AzureML SDK V2 unable to load data from onelake datastore #36575

Open mh-hassan18 opened 1 month ago

mh-hassan18 commented 1 month ago

Problem Description I have created onelake datastore with AzureML following this documentation. Actually I have created a datastore for files section of my lakehouse in Fabric. The datastore was created successfully and I am able to see all data in AzureML through browse mode of datastore. As you can see below:

datastore_browse

Now when I try to load some csv file from the datastore using the already provided code i.e. when you go to datastore browse -> csv file -> copy usage code. The code does not work.

datastore_browse_copy_usage_code

Here is the usage code:

import pandas as pd
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())

uri = "uri_here"
df = pd.read_csv(uri)

The above code does not work and gives following error

Message: rslex failed, falling back to clex. Payload: {"pid": 4312, "rslex_version": "2.18.3", "version": "4.11.3"} ExecutionError: Error Code: ScriptExecution.DatastoreResolution.Unexpected Failed Step: 782d9174-dcf9-420c-bbbc-c8c28dc9b01e Error Message: ScriptExecutionException was caused by DatastoreResolutionException. DatastoreResolutionException was caused by UnexpectedException. Unexpected failure while fetching info for Datastore 'devops_onelake_poc_datascience_files' in subscription: 'subs_id_here', resource group: 'resource_group_here', workspace: 'workspace_name_here'. Using base service url: https://northeurope.experiments.azureml.net./ Unable to deserialize the response. | session_id=l_c67e09b7-6567-4371-84ab-e49d32322aa6

I even tried registering the csv file as a data asset from the portal and then used the consume code from registered asset but that also did not work.

Here is from where I registered the csv file as a data asset. register_as_data_asset

Here is from where I copied the consume code of registered data asset. consume_registered_data_asset

Here is the code that I copied from "consume" tab of registered data asset.

import pandas as pd
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get("FABRIC_BRAZIL_PROMOTION_2", version="1")

df = pd.read_csv(data_asset.path)

But the above code also does not work and gives following error:

Message: rslex failed, falling back to clex. Payload: {"pid": 4312, "rslex_version": "2.18.3", "version": "4.11.3"} ExecutionError: Error Code: ScriptExecution.DatastoreResolution.Unexpected Failed Step: 9c0f36e5-086c-45fd-b35c-bde43380ff04 Error Message: ScriptExecutionException was caused by DatastoreResolutionException. DatastoreResolutionException was caused by UnexpectedException. Unexpected failure while fetching info for Datastore 'devops_onelake_poc_datascience_files' in subscription: 'sub_id_here', resource group: 'resource_group_here', workspace: 'workspace_name_here'. Using base service url: https://northeurope.experiments.azureml.net./ Unable to deserialize the response. | session_id=l_c67e09b7-6567-4371-84ab-e49d32322aa6

To Reproduce Steps to reproduce the behavior:

  1. Create onelake datastore for files section of fabric lakehouse with AzureML.
  2. Try to read files from datastore using azureml sdk v2 (1.18.0 )

Expected behavior We should be able to load the data from onelake datastire by using provided "consume/usage" code in the portal. This works for other types of datastores and should work for onelake datastore as well.

Additional context We also created one datastore for tables section of our lakehouse in Fabric. The datastore was created successfully and we were able to browse the data from portal but as far as consuming or loading data is concerned we faced the same issues as we faced for the datastore of files section of lakehouse. The documentation says "At this time, Machine Learning supports connection to Microsoft Fabric lakehouse artifacts in "Files" folder that include folders or files and Amazon S3 shortcuts. " But we were able to create datastore for tables section as well, so tables section is also supported now or not?

github-actions[bot] commented 1 month ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/azure-ml-sdk @azureml-github.

swathipil commented 1 month ago

Hi @mh-hassan18 - Thanks for the detailed info. We'll take a look asap!

achauhan-scc commented 1 month ago

@mh-hassan18 - So far, I am unable to see the problem in your code, using the same code I am able to fetch the data. few things to recheck in first method your uri format should be like below uri = "azureml://subscriptions/sub_guid (xxxxxxx)/resourcegroups/achauxxxx/workspaces/workspace-name/datastores/datastore-name/paths/filename" df = pd.read_csv(uri) df

in second print the data_asset.path and it should match with the same as above path azureml://subscriptions/sub_guid (xxxxxxx)/resourcegroups/achauxxxx/workspaces/workspace-name/datastores/datastore-name/paths/filename ml_client = MLClient(credential, subscription_id, resource_group, workspace_name=workspace_name) data_asset = ml_client.data.get("one_data_set", version="1")

df1 = pd.read_table(data_asset.path)

mh-hassan18 commented 1 month ago

@achauhan-scc

Thank you for your response.

I confirm that my uri is exactly in the same format as you specified. (As I am using the same code from "copy_usage_code" from portal UI as I specified above).

In the second method when I print data_asset.path it matches with the above uri.

But still I am getting the same issue as I described in my original question.

mh-hassan18 commented 1 month ago

@achauhan-scc

Here is another update. Earlier, I was using python 3.10 - SDK V2 kernel. Now I tried the same code after switching to python 3.8 - AzureML kernel and it's working fine for the datastore that I created for files section of my Fabric Lakehouse.

But using same python 3.8 - AzureML kernel when loading anything from the datastore that I created for tables section of my fabric lakehouse, I am getting following error:

"ValueError: No objects to concatenate"

So to summarize:

  1. Onelake datastore created for files section of fabric lakehouse is working fine when tried with python 3.8 - AzureML kernel but giving issues when using python 3.10 - SDK V2 kernel. (Issue details are listed above).
  2. Onelake datastore created for tables section of fabric lakehouse is not working with both python 3.8 - AzureML kernel and python 3.10 - SDK V2 kernel. For python 3.10 - SDK V2 kernel it gives "ScriptExecutionException was caused by DatastoreResolutionException." (as listed above). For python 3.8 - AzureML kernel it gives "ValueError: No objects to concatenate"

Note: In both kernels, I am using azure-ai-ml: 1.18.0

achauhan-scc commented 1 month ago

@mh-hassan18 - needs to update the versions of azureml-dataprep-rslex

=2.19.2 and azureml-dataprep = 4.12.1 as the versions they are using (rslex 2.18.3 and dataprep 4.11.3) do not contain the changes to handle OneLake datastores when materializing it into a pandas dataframe.

mh-hassan18 commented 1 month ago

Hi @achauhan-scc thank you for the support.

I updated the versions for the above two libraries and every thing worked fine for python 3.10 - SDK V2 kernel.

But I have got one more issue. Surprisingly, today when I tested the same script with python 3.8 - AzureML kernel, it's not working. The same script was working fine the other day with this kernel as I described in my last comment. But today when executing the following script with python 3.8 - AzureML kernel,


import pandas as pd
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
uri = "uri_here"
df = pd.read_csv(uri)
df

I am getting following error:

ImportError: Unable to load filesystem from EntryPoint(name='azureml', value='azureml.fsspec.AzureMachineLearningFileSystem', group='fsspec.specs')

I am getting the same issue as above with the following code as well (again with python 3.8 - AzureML kernel):

import pandas as pd
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get("FABRIC_BRAZIL_PROMOTION_2", version="1")
print(data_asset.path)

df = pd.read_csv(data_asset.path)
df

Note: I have not changed anything in python 3.8 - AzureML kernel and here are the library versions in this kernel: azureml-dataprep = 4.12.1 azureml-dataprep-rslex = 2.19.2 azure-ai-ml = 1.18.0

mh-hassan18 commented 1 month ago

@achauhan-scc Any update on this?