Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4k stars 2.49k forks source link

azureml.core.Run.log_*() logs are not working in child jobs #1922

Open pezosanta opened 1 year ago

pezosanta commented 1 year ago

Hi everyone,

I am trying to to build an AML pipeline for object detectionc/instance segmentation, where the last component would be used for training and model evaluation.

The pipeline is defined via the YAML format/schema (see below) and is run with az ml job create --file pipeline.yaml:

I want to highlight/visualize a lot of metrics in the Metrics tab of the component like time-series metrics (loss, f1 etc.), X/Y graphs, confusion matrix etc. As the MLFlow API only support time-series-like metric logging (log a single metric value in each iteration/epoch etc.), for logging more advanced metrics, I try to use the azureml.core.Run.log* interface. The problem is that, these logs are only logged into the Output + logs as json files and not as metrics/graphs into the Metrics tab if they are logged at all. Here are the problematic metric logs:

The codes used for these logs are as follows:

from azureml.core import Run
...

run = Run.get_context(allow_offline=False)
run.log_table("Y over X", {"x":[1, 2, 3], "y":[0.6, 0.7, 0.89]})
run.log_confusion_matrix(
        name="Confusion matrix",
        value = {
            "schema_type": "confusion_matrix",
            "schema_version": "1.0.0",
            "data": {
                "class_labels": ["class1", "class2", "class3", "class4"],
                "matrix": [
                    [4, 0, 1, 9],
                    [0, 0, 0, 1],
                    [6, 0, 5, 0],
                    [0, 0, 0, 1]
                ]
            }
        }
    )
run.log_accuracy_table(
        name="Accuracy Table",
        value= {
            "schema_type": "accuracy_table",
            "schema_version": "1.0.1",
            "data": {
                "probability_tables": [
                    [
                        [82, 118, 0, 0],
                        [75, 31, 87, 7],
                        [66, 9, 109, 16],
                        [46, 2, 116, 36],
                        [0, 0, 118, 82]
                    ],
                    [
                        [60, 140, 0, 0],
                        [56, 20, 120, 4],
                        [47, 4, 136, 13],
                        [28, 0, 140, 32],
                        [0, 0, 140, 60]
                    ],
                    [
                        [58, 142, 0, 0],
                        [53, 29, 113, 5],
                        [40, 10, 132, 18],
                        [24, 1, 141, 34],
                        [0, 0, 142, 58]
                    ]
                ],
                "percentile_tables": [
                    [
                        [82, 118, 0, 0],
                        [82, 67, 51, 0],
                        [75, 26, 92, 7],
                        [48, 3, 115, 34],
                        [3, 0, 118, 79]
                    ],
                    [
                        [60, 140, 0, 0],
                        [60, 89, 51, 0],
                        [60, 41, 99, 0],
                        [46, 5, 135, 14],
                        [3, 0, 140, 57]
                    ],
                    [
                        [58, 142, 0, 0],
                        [56, 93, 49, 2],
                        [54, 47, 95, 4],
                        [41, 10, 132, 17],
                        [3, 0, 142, 55]
                    ]
                ],
                "probability_thresholds": [0.0, 0.25, 0.5, 0.75, 1.0],
                "percentile_thresholds": [0.0, 0.01, 0.24, 0.98, 1.0],
                "class_labels": ["class1", "class2", "class3"]
            }
        },
        description="Some description."
    )

Here are some screenshots of the Azure ML dashboard.

IMPORTANT

If I run a simple python script as a job (so no pipeline definitions etc.) the run.log_accuracy_table(), run.log_confusion_matrix() and _run.log_table() metrics are logged properly.

aml-simple-job

Is this behaviour just a bug related to child jobs?