dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

SchemaShape.Column constructor is internal??? #3243

Closed zeahmed closed 5 years ago

zeahmed commented 5 years ago

IEstimator is a public interface that allows to create new estimators in ML.NET using NuGet only. The interface is public which defines two methods Fit and GetOutputSchema.

https://github.com/dotnet/machinelearning/blob/1724da898d8cff88543fcf9e7b356ef0989b7bf7/src/Microsoft.ML.Core/Data/IEstimator.cs#L301

The GetOutputSchema method takes SchemaShape as input. The only way to create SchemaShape is through its constructor that takes SchemaShape.Column as input.

https://github.com/dotnet/machinelearning/blob/1724da898d8cff88543fcf9e7b356ef0989b7bf7/src/Microsoft.ML.Core/Data/IEstimator.cs#L129

However, the only constructor of SchemaShape.Column is internal which blocks external user to create new estimators.

https://github.com/dotnet/machinelearning/blob/1724da898d8cff88543fcf9e7b356ef0989b7bf7/src/Microsoft.ML.Core/Data/IEstimator.cs#L62

How can I create a new estimator using NuGet only?

TomFinley commented 5 years ago

That is intentional, since we did not seem to have a scenario that required it, and it is a fairly large API to reveal its public surface (on the order of the same size of DataViewSchema).

The scenario you've outlined, creating a new estimator from scratch, is not something we decided to support for v1. (Even if this issue would be solved, you'd face another problem once you had the job of creating an ITransformer implementation (assuming you also wanted to implement one of those from scratch), is that while the interface ICanSaveModel is a public type, and ModelSaveContext is a public class, no methods on that class are accessible. (This done deliberately.)

This lack of extensibility at the IEstimator and ITransformer level is deliberate since these mechanisms (of which SchemaShape is only one of several, there's as mentioned model saving, model loading, which is a whole other kettle of fish) is not something we had time to refine prior to v1.

You could I guess build an IEstimator that returns some existing ITransformer instance rather than write one of those from scratch, but then in such a case as that you wouldn't need to create novel transform estimator instances anyway.

I wonder if the custom transformer, which is our current "extensibility" story, works for whatever you have in mind?

jeremytieman commented 5 years ago

No, the custom transformer does not work for our story. We are specifically looking to extend ML.NET for third-party customers to easily incorporate a new piece of functionality into their pipelines. We want to be able to tell people to just place the new Estimator/Transformer into their pipeline and have it work.

wschin commented 5 years ago

@jeremytieman, With our new type register system, the custom transformer can handle aribitrary C# classes now. Please take a look at this example.

Since we intentionally hide SchemaShape.Column, I think this issue can be closed. Please feel free to reopen it if you still have concerns.

baruchiro commented 5 years ago

From this discussion, I don't understand how I can use GetOutputSchema?

Let's say I want to convert all int columns to single, for example. If I'm not in the first estimator, I'm required to remember all the transformations I concatenated, to remember which columns are currently int.

Or I can get the outputSchema and choose the relevant columns from it.

I suggest at least make the following method public:

https://github.com/dotnet/machinelearning/blob/fed45bb7bed4eaf7c7b0e0bd43dde4f6e334f6e0/src/Microsoft.ML.Core/Data/IEstimator.cs#L174-L201