Add API to save/load models with their input schema

yaeldekel commented 5 years ago

Reasons for this are listed in issue #2663.

Currently, ModelOperationsCatalog offers the following API:

public void Save(ITransformer model, Stream stream) 
public ITransformer Load(Stream stream)

So when using a loaded model, users have to create the IDataView to be passed to the ITransformer themselves by creating a new TextLoader, (or another way?). I suggest adding these new APIs to ModelOperationsCatalog:

public void Save<TSource>(IDataReader<TSource> model, Stream stream);
public void Save<TSource>(IDataReader<TSource> reader, ITransformer model, Stream stream);
public IDataReader<TSource> Load<TSource>(Stream stream);

The last one would return a CompositeDataReader containing the loader and the ITransformer chain, so we could also add new APIs to DataOperationsCatalog to only load the reader:

public TextLoader CreateTextLoader(Stream stream);

Another option is to add an API that creates a PredictionEngine from a Stream, or an API that creates a SchemaDefinition from a Stream (that way users can use the existing API to create a PredictionEngine).

@TomFinley, what do you think?

TomFinley commented 5 years ago

Hi @yaeldekel, this this seems like a good first step. We will need at least what you've proposed here I think, so adding these would certainly not be harmful. I'd also add that saving the input schema itself in the case where the loaded data is, say, programmatically defined may also be necessary. (Sometimes you aren't loading using a loader at all, but this does not mean preserving the input schema is any less important.)

There's also a few more interesting things. You'll note the presence of this IDataReader<TSource> reader. Yet, IDataReader<in TSource> does not depend on ICanSaveModel (we made transformers descend from it, per @artidoro's #2431. It seems like it probably should.

It also seems to me that the presence of this <in TSource> is an interesting wrinkle; back when we had pipelines based on the old IDataLoader, there was no confusion about what the type of that was -- always an IMultiStreamSource. Yet in the new world, it could be practically anything. Or it could be nothing at all, but we might still want to be able to load the schema. (But this seems to me to have devolved into the dual of the point I made in the first paragraph. It seems like we'll need a way to load a model file and get only what had been the GetOutputSchema() on the reader if specified, or the data view schema if that was what was saved with the pipeline.)

Anyway: think the work you've proposed is a positive first step, and I think we should give it a shot, But it seems to me we need to develop this idea more fully. Those are just the most obvious holes in the idea I see right off the bat, there may be more, or solutions might become more obvious once we start practically working on it, as I find is often the case.

eerhardt commented 5 years ago

If this is strictly "adding" APIs, I don't think this is "Project 13" work. We can add those APIs after v1.

Do you view this as something that cannot be fixed after v1?

TomFinley commented 5 years ago

If this is strictly "adding" APIs, I don't think this is "Project 13" work. We can add those APIs after v1.

Do you view this as something that cannot be fixed after v1?

I consider the APIs need to change, since they are saving "incomplete" models. So I'd like to remove and rework the APIs in their current form, since they are leading people into "pits of failure."

dotnet / machinelearning

Add API to save/load models with their input schema #2735