aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.03k stars 6.75k forks source link

predict_fn in "Inference Pipeline with Scikit-learn and Linear Learner" nb.insert #1621

Open nabti opened 3 years ago

nabti commented 3 years ago

Hi,

Can anyone please help me work around this issue? I'am trying to use the preprocess Pipeline that uses this function: `def predict_fn(input_data, model): """Preprocess input data

We implement this because the default predict_fn uses .predict(), but our model is a preprocessor
so we want to use .transform().

The output is returned in the following order:

    rest of features either one hot encoded or standardized
"""
features = model.transform(input_data)

if label_column in input_data:
    # Return the label (as the first column) and the set of features.
    return np.insert(features, 0, input_data[label_column], axis=1)
else:
    # Return only the set of features
    return features`

Of course there are the other functions but I'am having a problem with this function. When I test my preprocessing with a small dataset, it works fine without problems. When I try the same thing with a larger size of data I get this error:

axis 1 is out of bounds for array of dimension 0

I get the error on this line : np.insert(features, 0, input_data[label_column], axis=1) when I do the transformer part: `transformer = sklearn_preprocessor.transformer( instance_count=1, instance_type='ml.m4.xlarge', assemble_with = 'Line', accept = 'text/csv')

transformer.transform(train_input, content_type='text/csv') print('Waiting for transform job: ' + transformer.latest_transform_job.job_name) transformer.wait() preprocessed_train = transformer.output_path`

The dataset has like 800000 rows and on it there are some features on which I perform a one hot encoding resulting to larger shape. the first time I tested with the exact preprocessing the only difference is the size (400 rows instead of 800 000)

Any help? The problem is that even the 800 000 is not the final dataset I will have to do the same thing with like 30 millions rows

Thank you

peltapr commented 3 years ago

Hi nabit,

I have encountered the same issue and it has been making me crazy all day, I just found a workaround but I'm not sure how accurate it is.

The issue is that for large datasets "features = model.transform(input_data).toarray()" returns a 'scipy.sparse.csr.csr_matrix' so you cannot use np.insert on it.

The workaround I have is to transform it to an np.array: features = model.transform(input_data).toarray()

Did you find any other solution?

Cheers

nabti commented 3 years ago

Hi @peltapr Yes I have worked with .toarray() as you said. Didn't found better solution