dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Consider making DataView and IEstimator Preview methods return DataTable #2912

Open eerhardt opened 5 years ago

eerhardt commented 5 years ago

Visual Studio has the ability to visualize a data set if the object is a System.Data.DataTable.

We currently have some debugger extension methods:

https://github.com/dotnet/machinelearning/blob/b861b5d64841cbe0f2c866ee7586872aac450a51/src/Microsoft.ML.Data/DebuggerExtensions.cs#L15-L62

We should consider making these return DataTable, so VS can visualize them. Or potentially we could add ancillary methods that return DataTable, and keep the current methods returning DataDebuggerPreview, if we think the current experience has value over the visualization in VS.

eerhardt commented 5 years ago

Marking as Project 13, since these are public APIs. And if we want to change the return type, we will need to do it before v1.0.

shauheen commented 5 years ago

@Zruty0 do you know why this was done without DataTables originally?

shauheen commented 5 years ago

@asthana86

TomFinley commented 5 years ago

@Zruty0 do you know why this was done without DataTables originally?

I'm not @Zruty0 , but if I read the documentation on supported types, there are some questions I have. Most notable at least at first glance is the lack of anything resembling vector valued types with the exception of byte arrays. There are user defined types. I wonder how well they display?

Speaking of types, there seems to be a correspondence with the .NET type system, which for our purposes we found insufficient. Knowing the DataViewType of a column strikes me as somewhat important information to know. (E.g., what is the vector size encoded in this type.)

I also see the Add method can potentially throw a DuplicateNameException. This makes me wonder whether the aspect of "column hiding" as we implement it is even possible. Also I see a troubling note here about the comparisons of column names being case insensitive. Neither of these restrictions are compatible with data view.

What about metadata? I guess we could have put things into ExtendedProperties, but that seems imperfect at best.

I guess these points are more or less duplicating the same ground for why we (by which I mean Shon and everyone) found the types in System.Data unfit for our purposes and designed IDataView in the first place. If what we needed to do actually mapped naturally to those abstractions, we would have just used those abstractions and saved ourselves a bunch of time.

eerhardt commented 5 years ago

Would it be possible to keep DataDebuggerPreview as is today, but with the addition of some way of turning it into a DataTable (through another Preview method, a ToDataTable off of DataDebuggerPreview , or some DebuggerDisplayAttribute)?

I agree with your analysis above that we need to keep the current debugging mechanisms for a "full, unabridged view" of an IDataView. But being able to visualize the IDataView in a grid, while it has the shortcomings you list above, still seems valuable to a set of users.

eerhardt commented 5 years ago

@shauheen mentioned an interesting idea to me:

We could put a IDataView.PreviewDataTable() or IDataView.Preview().ToDataTable() extension method in the Microsoft.ML.Experimental nuget package. That way we could experiment with this API, but not be locked into it if it turns out to be something we don't want to support long term.

That would unblock the data visualizer - a customer could grab the ML.Experimental nuget package, call this method in the Watch window, and see the IDataView in a UI grid.

danmoseley commented 5 years ago

If you put [System.ObsoleteAttribute("Do not call directly. For Visual Studio use only", error:true)] on the method, it will be difficult for anyone to compile code against it (by default, compilation will fail). However to my knowledge the debugger will not care about it.

If you also add [EditorBrowsable(EditorBrowsableState.Never)] it will not appear in intellisense, but again, the debugger will not care.

Finally you could add _DebuggerUseOnly to the method name.

eerhardt commented 5 years ago

[System.ObsoleteAttribute("Do not call directly. For Visual Studio use only", error:true)] on the method,

Here is what I see in the Watch window intellisense when you do this: image

Notice it says [deprecated], which makes it seem like I shouldn't use it, even in the Watch window.

If you also add [EditorBrowsable(EditorBrowsableState.Never)] it will not appear in intellisense, but again, the debugger will not care.

Here is what I see in the Watch window intellisense when I have this:

    public static class SchemaExtensions
    {
        [EditorBrowsable(EditorBrowsableState.Never)]
        public static List<int> Preview(this Schema schema)
        {
            return new List<int>(new[] { 1, 2, 3 });
        }

        public static List<int> Preview2(this Schema schema)
        {
            return new List<int>(new[] { 1, 2, 3 });
        }
    }

image

Notice that I don't see the Preview method that is marked with [EditorBrowsable(EditorBrowsableState.Never)]. So users really won't be able to discover this if it doesn't show in the Watch window intellisense.