MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.2k stars 21.35k forks source link

How to create Data Asset using Datastore connecting to SQL Server #118516

Open SouravMalliK opened 8 months ago

SouravMalliK commented 8 months ago

Hello,

How do we create a data asset of Tabular type which could store data retrieved from a Azure SQL Database Query used in a Datastore using Azure SDK V2.

I see we could create such dataset using SDK V1 -- https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py

Is there any Class implemented on SDK V2?

BR, Sourav.


Document Details

Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.

AjayBathini-MSFT commented 8 months ago

@SouravMalliK Thanks for your feedback! We will investigate and update as appropriate.

Naveenommi-MSFT commented 8 months ago

Hi @SouravMalliK To create a data asset of Tabular type which could store data retrieved from an Azure SQL Database Query used in a Datastore using Azure SDK V2, you can use the TabularDatasetFactory class from the azureml.core.dataset_factory module. Here is an example code snippet:

from azureml.core import Workspace, Dataset
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.data.dataset_factory import TabularDatasetFactory

# Authenticate with Azure
interactive_auth = InteractiveLoginAuthentication(tenant_id="<your-tenant-id>")
ws = Workspace(subscription_id="<your-subscription-id>",
               resource_group="<your-resource-group>",
               workspace_name="<your-workspace-name>",
               auth=interactive_auth)

# Define SQL query and datastore
sql_query = "SELECT * FROM <your-table-name>"
datastore_name = "<your-datastore-name>"

# Create TabularDatasetFactory object
tabular_dataset = TabularDatasetFactory.from_sql_query(
    datastore=ws.datastores[datastore_name],
    query=sql_query,
    validate=True
)

# Register the dataset
tabular_dataset.register(workspace=ws, name="<your-dataset-name>", create_new_version=True)

In this example, we first authenticate with Azure using the InteractiveLoginAuthentication class. Then, we define the SQL query and the name of the datastore where the data resides. We create a TabularDatasetFactory object using the from_sql_query method, passing in the datastore and the SQL query. Finally, we register the dataset with the workspace using the register method.

I hope this helps! If there are any further questions regarding the documentation, please tag me in your reply and we will be happy to continue the conversation.

SouravMalliK commented 8 months ago

Hi @Naveenommi-MSFT,

Thank you for the reply, but I was looking for something in SDK V2 (the feedback I created was on the SDK v2 documentation). I'm familiar with the snippet you shared, in-fact I already mentioned the same in my comment.

I'm consuming data from different data sources - like from flat files (mostly parquet) & Azure SQL Server and storing them as Data assets in AML workspace using azure.ai.ml.entities.Data class. I'm able to access these data as MLTables hence would like to stick to azure.ai.ml libraries for consuming data from Azure SQL server too.

I found a ARM template which creates a tabular dataset from SQL query.

{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"resources": [
{
"type": "Microsoft.MachineLearningServices/workspaces/datasets",
"name": "[concat(parameters('workspaceName'), '/', parameters('datasetName'))]",
"apiVersion": "2020-05-01-preview",
"location": "[parameters('location')]",
"properties": {
"SkipValidation": "[parameters('skipValidation')]",
"DatasetType": "tabular",
"Parameters": {
    "Query": {
        "Query": "[parameters('sqlQuery')]",
        "DatastoreName": "[parameters('datastoreName')]"
    },
    "SourceType": "sql_query"
},
"Registration": {
    "Description": "[parameters('datasetDescription')]",
    "Tags": "[parameters('tags')]"
},
"TimeSeries": {
    "FineGrainTimestamp": "[parameters('fineGrainTimestamp')]",
    "CoarseGrainTimestamp": "[parameters('coarseGrainTimestamp')]"
}}]}

I'm looking for a similar option in Python SDK v2 which would help me to create a Dataset using SQL query and then consume the dataset using a MLTable, like this --

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get("dataset-name", version="1")

tbl = mltable.load(f'azureml:/{data_asset.id}')

df = tbl.to_pandas_dataframe()
Naveenommi-MSFT commented 8 months ago

@SouravMalliK Thank you for bringing this to our attention. I've delegated this to content author @AmarBadal, who will review it and offer their insightful opinions.