Passing the prediction dataframe to Serializer along with the pipeline model

jeffsaremi commented 5 years ago

I don't understand the connection between the prediction results (from a call to model.transform()) and serialization of a model (created from a call to pipeline.fit()).

Is this prediction set saved and used later when I deserialize my model as an MLeap pipeline? (see the code below)

spark_prediction = model.transform(test_data)
model.serializeToBundle(model_zip_url, spark_prediction)
mleap_pipeline = PipelineModel.deserializeFromBundle(model_zip_url)
mleap_prediction = mleap_pipeline.transform(test_data)

Is mleap_prediction actually uses the saved spark_prediction?

Can these two be different? the test_datapassed to create spark_predictionand the test_datapassed in the call to mleap_pipeline.transform()?

In the call to serializeToBundle() can I just pass one single record as the test_data?

What is the significance of the predicted data and what does MLeap do with that data?

Can I serialize a model without passing any predicted data?

thanks

ancasarb commented 5 years ago

hey @jeffsaremi, see below some answers to your questions, let me know if it makes sense.

Is mleap_prediction actually uses the saved spark_prediction?

No, spark_prediction is used strictly for the pipeline serialization.

Can these two be different? the test_datapassed to create spark_predictionand the test_datapassed in the call to mleap_pipeline.transform()?

You can transform any dataset with the deserialized pipeline, mleap_pipeline.

In the call to serializeToBundle() can I just pass one single record as the test_data?

The transformed data frame is mostly used to extract data types and some metadata, so you could try with just one record and see how it goes.

What is the significance of the predicted data and what does MLeap do with that data?

The transformed data frame is used to extract data types and other metadata required for execution so that they can be serialized.

Can I serialize a model without passing any predicted data?

No, see above.

tovbinm commented 5 years ago

I have looked into the MLeap code and it seems that transformed Dataframe is only used to get the schema: StructType from it. I propose to replace requirement of passing Dataframe with just the schema StructType.

WDYT? @ancasarb

combust / mleap

Passing the prediction dataframe to Serializer along with the pipeline model #488