Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.11k stars 2.52k forks source link

AML: When using ParallelRun with a Tabular Dataset, is the delimiter always a `space` - clarify documenation/example #1486

Open hamelsmu opened 3 years ago

hamelsmu commented 3 years ago

@keijik @cody-dkdc @gregce

In this example notebook you show the delimter in the file that is written to as being a space. Using space a delimiter seems like a really dangerous choice. Can you change what the delimiter is? If so how? It seems like from this example that this is the default delimiter for tabular datasets which seem problematic.

image

The scoring script for this example is here as you can see, space is not indicated anywhere in the scoring script, so how does this delimiter come out? If this is the default delimiter, I think this is worth explaining.

iris_score.py

import io
import pickle
import argparse
import numpy as np

from azureml.core.model import Model
from sklearn.linear_model import LogisticRegression

from azureml_user.parallel_run import EntryScript

def init():
    global iris_model

    logger = EntryScript().logger
    logger.info("init() is called.")

    parser = argparse.ArgumentParser(description="Iris model serving")
    parser.add_argument('--model_name', dest="model_name", required=True)
    args, unknown_args = parser.parse_known_args()

    model_path = Model.get_model_path(args.model_name)
    with open(model_path, 'rb') as model_file:
        iris_model = pickle.load(model_file)

def run(input_data):
    logger = EntryScript().logger
    logger.info("run() is called with: {}.".format(input_data))

    # make inference
    num_rows, num_cols = input_data.shape
    pred = iris_model.predict(input_data).reshape((num_rows, 1))

    # cleanup output
    result = input_data.drop(input_data.columns[4:], axis=1)
    result['variety'] = pred

    return result
benchwang commented 3 years ago

@hamelsmu , you're exact that the default delimiter is a space .

If you want to change it, you can create a file parallel_run_step.settings.json in source directory, (i.e., beside your iris_score.py here), and add following setting in the file:

{
    "append_row": {
        "pandas.DataFrame.to_csv": {
            "sep": ","
        }
    }
}

The above sample will instruct the output csv to use "," as separator. You can change it to what want unless it's a valid one for pandas.

Thanks!