Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.54k stars 2.76k forks source link

Custom preprocessing for Monitoring in AML Example in GitHub #33910

Closed anaarmas-sys closed 7 months ago

anaarmas-sys commented 7 months ago

Hello guys! Could you help me please: I'm trying to create the Monitoring by bringing own production data to Azure Machine Learning. I´m using the yml & python scripts that are suggested by GitHub. When working with the python script to create the preprocessing component using my own data, I´m facing an issue: ======================================> An error occurred while calling o1100.parquet. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:651) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:284) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:187) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) ........etc. =======================================< I have trying to understand where does the error come from. I think the error comes from the data_as_df = spark.createDataFrame(pd.read_json(first_data_row["data"])) :

Next is part of the python script:

Output MLTable

first_data_row = df.select("data").rdd.map(lambda x: x).first()

spark = init_spark()
data_as_df = spark.createDataFrame(pd.read_json(first_data_row["data"]))

def tranform_df_function(iterator):
    for df in iterator:
        yield pd.concat(pd.read_json(entry) for entry in df["data"])

transformed_df = df.select("data").mapInPandas(
    tranform_df_function, schema=data_as_df.schema
)

save_spark_df_as_mltable(transformed_df, preprocessed_input_data)

Could you provide me the jsonl file used for the example in GitHub?

Here is the link of GitHub: https://github.com/Azure/azureml-examples/blob/main/cli/monitoring/components/custom_preprocessing/src/run.py

kashifkhan commented 7 months ago

Thank you for the feedback @anaarmas-sys . We will investigate and get back to you asap.

cc @azureml-github

github-actions[bot] commented 7 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/azure-ml-sdk @azureml-github.

yunjie-hub commented 7 months ago

Want to clarify with you that when you refer to bringing your production data, does it mean you don't have the model deployment in AML, and you don't use the MDC(https://learn.microsoft.com/en-us/azure/machine-learning/concept-data-collection?view=azureml-api-2) to collect the data, instead you are uploading your own model collected data.

If that's the case, you likely need to write your custom preprocessor (component), so that you can process your data with your logic into the MLTable

You don't need to follow the processing logic from here, instead, you can write your logic https://github.com/Azure/azureml-examples/blob/main/cli/monitoring/components/custom_preprocessing/src/run.py

the example here is trying to processing some JSONL file format with this kind of dat { "specversion": "1.0", "id": "d9e4aa9e-8033-4793-a889-0e0d67778392", "source": "/subscriptions/xxxx", "type": "azureml.inference.model_inputs", "datacontenttype": "application/json", "time": "2023-12-14T05:50:03Z", "data": [ { "question": "what is feature store in azure machine learning", "chat_history": [

  ]
}

], "contentrange": "bytes 0-81/82", "correlationid": "960dfd3d-2b8a-446b-b2cb-48b36bcccc48", "xrequestid": "960dfd3d-2b8a-446b-b2cb-48b36bcccc48", "modelversion": "default", "collectdatatype": "pandas.core.frame.DataFrame", "agent": "azureml-ai-monitoring/0.1.0b4" }

anaarmas-sys commented 7 months ago

Hello @yunjie-hub , Thanks for mention about to clarify, because it was not so clear for me in the documentation https://learn.microsoft.com/en-us/azure/machine-learning/how-to-monitor-model-performance?view=azureml-api-2&tabs=azure-cli let´s see:

You say: Want to clarify with you that when you refer to bringing your production data, does it mean you don't have the model deployment in AML, .....

Oh, I am working with a model deployed in AML but without auto-collector added in the deployment script (that is ok). I had interpretated in this way because in the documentation there are contradictory things (i think):

Let´s see

In the part of the document titled:

it says

You can also set up model monitoring for models deployed to Azure Machine Learning batch endpoints or deployed outside of Azure Machine Learning.If you have production data but no deployment, you can use the data to perform continuous model monitoring. To monitor these models, you must meet the following requirements: ......etc. etc.

and, in the related Yaml says: . . . _create_monitor: compute: instance_type: standard_e4s_v3 runtime_version: 3.2 monitoring_target: ml_task: classification endpoint_deployment_id: azureml:fraud-detection-endpoint:fraud-detection-deployment_ (<----- this) . . . etc.

I mean, the code include a deployment. Then, It was difficult the interpretation, i thought: ok, then we have at least 3 scenarios: a) Monitoring with deployment+auto-collector added (related to the part of the documentation titled: Set up advanced model monitoring) b) Monitoring with deployment without collector (therefore I have to bring my own production data, as a batch) c) Monitoring a model registered in AML, but no deployed (therefore i have to bring my own production data)

Then, my issue in this issue #33910 is focus on case b). It exist this case?

Thanks a lot in advance!

ahughes-msft commented 7 months ago

Hi @anaarmas-sys ,

The case (b) you detailed above is a case we support. We support monitoring deployments on AzureML where inference data collection is not enabled. In this case, you're responsible for collecting your own inferencing data (model inputs, outputs, etc.) and storing them in Blob storage. You can then create data assets and reference these assets as part of your monitoring configuation.

Here is the documentation for the scenario: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-monitor-model-performance?view=azureml-api-2&tabs=azure-cli#set-up-model-monitoring-by-bringing-your-own-production-data-to-azure-machine-learning

Please note that the YAML file provided in the example in the documentation above has a deployment on AzureML, but this property, endpoint_deployment_id, is not required. You can omit it if you are deploying outside of AzureML.

After you have collected your data, you will need to create a custom preprocessing component so we know how to process it and compute the monitoring signals. As Yunjie mentioned above, an example logic for this preprocessing component can be found here: https://github.com/Azure/azureml-examples/blob/main/cli/monitoring/components/custom_preprocessing/src/run.py

Please adapt this example to your data scenario, depending on the schema it is collected in. The preprocessing component must have the input & output signatures detailed in the documentation. After you have created and registered this preprocessing component, please reference is as part of your data configuration in the monitoring configuration. Here is an example for how to do so in the CLI YAML:

production_data:

anaarmas-sys commented 7 months ago

Hi @ahughes-msft ! Thanks for continue helping me.

Indeed, for the case b), i have been applying/using the yml and run.py in the link that you mentioned above, i.e.: https://github.com/Azure/azureml-examples/blob/main/cli/monitoring/components/custom_preprocessing/src/run.py and yes, in CLI YAML i have the same structure: production_data: input_data: path: azureml:asset-autodataset-target-jsonl:5
type: uri_folder data_context: model_inputs data_window_size: P00DT23H59M pre_processing_component: azureml:prep_data_6:38

And the error mentioned at the top of this issue #33910 was using the structure in run.py

Please, which or where is the data that you are using? I couldn't find it.

ahughes-msft commented 7 months ago

Hi @anaarmas-sys ,

The data provided in our sample is MDC-generated. You will need to adapt the run.py in the example to preprocess your data into the component output structure required by the model monitoring system. Have you adapted the run.py to preprocess your specific data? https://learn.microsoft.com/en-us/azure/machine-learning/how-to-monitor-model-performance?view=azureml-api-2&tabs=azure-cli#set-up-model-monitoring-by-bringing-your-own-production-data-to-azure-machine-learning

If you are deploying with AzureML managed online endpoints or kubernetes online endpoints, we highly recommend using MDC to seamlessly collect the production inference data for you. Documentation for can be found here: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-collect-production-data?view=azureml-api-2&tabs=azure-cli

anaarmas-sys commented 7 months ago

Hi @ahughes-msft! Thanks for the suggestion. I have finished my Monitoring with my own data successfully for the case b) Monitoring with deployment without collector. I'm working with AzureML managed online endpoint. I thing we could close this issue #33910. TY!

And, i have open another doubt in ticket #34196: Add dependencies between components without the necessity to passing data through them. Thanks in advanced!