Interpreting a pipeline's resulting Schema and/or .Preview()

nganju98 commented 5 years ago

Going over this article to be able to inspect data after the preprocessing pipeline: https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/inspect-intermediate-data-ml-net

Coming from python where you have a dataframe and you can just do dataframe.show(), this is quite an ordeal. The CreateEnumerable() is pretty impractical because your original POCO won't fit the schema any more if you've done one-hot-encoding etc. The one-hot-encoding creates multiple columns with the same name, how can that ever map back to a POCO?

And there's a mandatory Features column at the end of the row, that repeats all the values in the row. And if you then NormalizeMinMax the Features column, you get ANOTHER Features column with the same name. Then you're trying to look at that in a Preview() and it's very confusing to see what's going on.

Then there's DataViewRowCursor, where you need to specify reflection-style getters for each column. So each time you tweak the pipeline you have to rewrite the code that lets you see the results of your pipeline? It defeats the ability to quickly tweak and look, tweak and look, in the way that python does it so simply with dataframe.show().

So assuming this is how we have to inspect our data, I've got two questions:

1) When a Transform (like OneHotEncoding) creates multiple columns with the same name, what's going on? Is the training algorithm going to look at all of them? Is the IsHidden how it's deciding what to use? If so, can there be some official documentation on how all of this works?

2) Why is it our responsibility to create a Features column at all? Can't the algorithms just run on the IDataView we created, like in python? It seems like a complicated and unnecessary step. Also does it seem like good OO design to have a Features field at the end of a row that repeats all the values in said row? If you need to make this Features vector for performance reasons, why not create it once Fit() is called, keep it out of our data table, and hide this implementation detail from the user?

acrigney commented 5 years ago

Excellent question, are the old feature columns going to be ignored during processing or do we have to remove them. Also if you remove the old feature columns when you do a prediction you get an error that the removed feature columns are not there. I am trying to dig through the code now.

antoniovs1029 commented 4 years ago

Hi, @nick-ganju , answering your questions:

Yes, the IsHidden property of the column is used to decide which column to use. The way ML.NET works is that when a Transformer adds a new column with the same name as an existing column, the previously existing column is "hidden" (i.e. the IsHidden property inside the DataViewSchema.Column is set to be true (link to code)). It's hidden in the sense that if you have another transform it would only use the non-hidden columns. But if you use .Preview() in a dataview, then it will show all the columns (including the hidden ones) because its purpose is to be used for debugging, and the hidden columns are usually intermediate steps to get to the final result in the non-hidden columns. Also, since the hidden columns are actually necessary for intermediate steps, it might cause an exception to remove them (as attempted by @acrigney ). Hiding columns is briefly explained in the docs found inside the repository (link to explanation); but I agree that it isn't a very visible place, and it is a somewhat confusing topic... so I will open an issue with our docs team to see if this can be added in other places.
Unfortunately, there's no current plan to change the API, but we'll consider your suggestions for future changes since I think they're valid concerns about the usability of ML.NET. However, notice that IDataViews are lazily evaluated (as stated in the tutorial you've linked to), so when cursoring over them you only compute one row at a time. There's no actual "data table" stored in memory containing all the rows, but the DataView holds the information on how to compute all columns (hidden or not) of a given row, but the values are only computed if the user tries to access them (e.g., through a prediction engine, a cursor.GetGetter, or .Preview()). Since it only computes and store memory when asked to do it, it wouldn't really store on memory all of the hidden columns of a given row, and so it doesn't actually have "a Features field at the end of a row that repeats all the values in said row" (because, again, DataViews doesn't store fields with values). It's only when using .Preview() that the value of hidden columns are actually saved in memory for different rows.

dotnet / machinelearning

Interpreting a pipeline's resulting Schema and/or .Preview() #4023