dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

GetColumn method is extension for IDataView but not mlContext. #2473

Closed Ivanidzo4ka closed 5 years ago

Ivanidzo4ka commented 5 years ago

https://github.com/dotnet/machinelearning/issues/1708 moved CreateEnumerable from extension on top of DataView to become extension on mlContext object.

We still have GetColumn extension https://github.com/dotnet/machinelearning/blob/7fc7e50ce6f8fed24fc0b9528839a0ac8d0ed320/src/Microsoft.ML.Data/Utilities/ColumnCursor.cs#L25 which works on top of DataView. Shall we move it to mlContext?

var enumerable = mlContext.CreateEnumerable<SamplesUtils.DatasetUtils.SampleTemperatureData>(filteredData, reuseRowObject: true); var originalColumnBack = transformedData_default.GetColumn<VBuffer<ReadOnlyMemory<char>>>(mlContext, defaultColumnName);

They look silly together.

Ivanidzo4ka commented 5 years ago

Can I have @TomFinley approval of this issue?

TomFinley commented 5 years ago

I think so. What is our intent with this API? I see a couple problems:

Usage of IHostEnvironment which you have identified. One thing that I'd like to explore here is whether IHostEnvironment is really necessary at all? I understand the infrastructure underlying it thinks it needs it, but maybe it kind of doesn't. In which case it could continue to be extension method over IDataView, just not one that takes an IHostEnvironment.

Generally IHostEnvironment is something useful for component authors, and other "deep" scenarios. This however seems like a very "casual use" scenario where the benefits of plumbing thorugh an IHostEnvironment are less clear. (The IDataView implementation itself surely already has an IHost.)

What do we think about that?

TomFinley commented 5 years ago

Also, I think this extension like other accessors on columns should take the Schema.Column.

Ivanidzo4ka commented 5 years ago

My intent is to have consistency in across our APIs. Main reason why I wrote this issue is this https://github.com/dotnet/machinelearning/issues/2466 issue. It feels weird to have one way to extract data be part of mlContext and another is part of IDataView. Also I remember we had "mlcontext to rule them all" issue, but I can't find it anymore.

From what I see in code IHostEnvironment used purely as IExceptionContext. I doubt I can get to IHost in IDataView (as implementation we probably have it, but not as interface, unless I would run reflection magic to inspect it runtime, but no).

So from what I'm hearing from you, you more leaning towards leave it as part of IDataView extension, and remove IHostEnvironment requirements from it (and change from columnName to Schema.Column) am I correct?

TomFinley commented 5 years ago

My intent is to have consistency in across our APIs. Main reason why I wrote this issue is this #2466 issue. It feels weird to have one way to extract data be part of mlContext and another is part of IDataView. Also I remember we had "mlcontext to rule them all" issue, but I can't find it anymore.

You are thinking of #1098. But that is creation of components. But what is being created? Usually such things as estimators, some conveniences to make data views. Yet, we do not involve MLContext when it comes to creation of ITransformer implementors. That is what the estimators (created with the MLContext) are for.

Likewise, and most relevant ot this point, we do not use MLContext to get data out of an existing IDataView. Here the IDataView already exists. We use cursors and whatnot, directly accessible off IDataView itself for that. And so I think we should continue to do if possible.

Indeed, I have observed a disturbing misconception about the intent of MLContext, as we see in #2672. We do not use MLContext because we want to embrace the "factory-of-everything" antipattern, where literally everything to do anything goes through this thing. We use it because of the things that need to have IHostEnvironment, we don't want to annoy the user with the IHostEnvironment. That is the purpose of the object, and all the purpose of that object. Otherwise, we should just make our API act like normal objects.

As far as I can tell, there is no genuine, real need to have an IHostEnvironment here, any more than there is a need for IDataView.GetRowCursor to accept an IHostEnvironment. (It may be that the code as its stands right now thinks it wants an IHostEnvironment. I might make it think otherwise.) I think just getting data out of a data view in a structured way ought not to involve too much appeal to separate IHostEnvironments.

Let me know if you feel differently, but I have some conviction on this point.