dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

Estimator arguments should take output column name as first parameter, any inputs as subsequent parameters #2064

Closed TomFinley closed 5 years ago

TomFinley commented 5 years ago

So, a somewhat embarrassing blunder to report... not quite sure how to say this.

Throughout our codebase, for years we've always specified the name of the output column of a transform first, then the source(s).

That's good. But when estimators were introduced, somehow, nearly all of them were introduced in the reverse order: nearly all of them specify the inputs, then the outputs. This was probably an unconscious mistake, but it's one with fairly wide consequences, since that mistaken pattern was copied again and again as people did the estimator conversion work, to the point where most (not all!) estimators are now inconsistent with the whole rest of the codebase!

This should be corrected: it is at least inconsistent, and even if not inconsistent actually rather obnoxious practically because specifying the name of the output first has practical benefits, and makes a lot more sense, since if you're specifying a transformation the most important information someone will want to know is what you're calling your output!

The Story Until a Few Months Ago

So, throughout our codebase, for years, it has been our practice that when specifying a transform, you specify the name of the output, then you specify the name of the input(s) (if any). The reason for this is practical: the outputs are usually the result of a well defined single calculation (the application of the transform), whereas what is taken as an "input" to a transform can have various meanings since it is sometimes a multivariate function in its inputs more often than in its outputs. (E.g., concatenate can have multiple inputs, text can have multiple inputs, MI feature selection takes multiple heterogeneous inputs, but all produce single output columns.)

This was our practice certainly when this code was a tool, as we see in the command line help string, specifying name then source.

https://github.com/dotnet/machinelearning/blob/bb46fdf405828134de9feebce74f21b4aacb15ed/src/Microsoft.ML.Data/Transforms/ColumnCopying.cs#L109-L110

This trend continued during our initial attempts at an API, as seen in this early discussion in PR 405, and as seen here:

https://github.com/zeahmed/machinelearning/blob/16f7883933b56f8fd86077bf0fd262b24374e9d0/src/Microsoft.ML.Data/Transforms/ConvertTransform.cs#L116

and here

https://github.com/zeahmed/machinelearning/blob/16f7883933b56f8fd86077bf0fd262b24374e9d0/src/Microsoft.ML.Data/Transforms/DropSlotsTransform.cs#L226

and numerous other places.

This is done for the practical reason that, when a transform produces an output, what outputs it has are usually finite and well defined, whereas it can take multiple examles. The most conspicuous and widely used example of this is the concatenation transform. Also included are things like the text featurizing transform, and other such things.

So far so good...

But Then...

Now, somehow, through some mechanism that wasn't quite clear to me, as IEstimator implementations are being created, someone commits a fateful mistake of reversing inputs and outputs. Suddenly instead of being typically parameterized as name and a sometimes optional source, we instead have the required input and a sometimes optional output. What a disaster! And, I did not catch it in review. An early example:

https://github.com/dotnet/machinelearning/blob/c8de311476eb04dd4c8be1ad5b898487d66c9ef5/src/Microsoft.ML.Transforms/Text/TextTransform.cs#L278-L279

Then, as people use this as a template for their own estimator conversion work, nearly all estimators copied this mistake, until practically all estimators had this problem. This includes much of the extensions.

https://github.com/dotnet/machinelearning/blob/d9270c9c42da70817e8a71e39a069d2339f6972d/src/Microsoft.ML.Data/Transforms/ExtensionsCatalog.cs#L30-L31

Now then, the reason why I know any of this is that @stephentoub wrote and says, "hey, how come you have your column concatenation operation specify output then inputs? That's different from everywhere else! I know it's less convenient, but it's better to be consistent."

https://github.com/dotnet/machinelearning/blob/d9270c9c42da70817e8a71e39a069d2339f6972d/src/Microsoft.ML.Data/Transforms/ExtensionsCatalog.cs#L39

And, after thinking that he must be very confused, since again, the pattern of name then source being very, very well defined throughout the codebase, I look and find he is absolutely correct, and a thoroughly massive blunder had somehow made it throughout the entire codebase under our very noses, including it must be said mine. :) :)

So, this is bad, but happily this was caught before v1. But, needless to say this must be fixed immediately.

justinormont commented 5 years ago

Semi-related: We may also want to take a pass at ensuring the parameter names are consistent as I expect this to be difficult to correct after v1.0. I assume the work would be creating a large spread sheet of rows of components & columns of parameter names (and ensuring we don't create a new column needlessly).

Ivanidzo4ka commented 5 years ago

I would like to disagree. (With idea what Output first, inputs later)

The reason for this is practical: the outputs are usually the result of a well defined single calculation (the application of the transform), whereas what is taken as an "input" to a transform can have various meanings since it is sometimes a multivariate function in its inputs more often than in its outputs.

I've tend to believe what reason behind name-> source is command line which force you to express yourself in kinda whiteboard way so it's somewhat natural to write y = func(x1,x2,x3) where you specify y first and then you specify rest.

For programmatic API in C# I found idea to write output parameter first, and then everything else, is bizarre and against nature. I look on transformers as some functions on top of data. https://docs.microsoft.com/en-us/dotnet/api/system.func-6?view=netframework-4.7.2 Func<in T1,in T2,in T3,in T4,in T5,out TResult> I don't see this idiom of putting output first in the way how C# views functions in general, I would prefer to have current order of most estimators to remain present.

TomFinley commented 5 years ago

For programmatic API in C# I found idea to write output parameter first, and then everything else, is bizarre and against nature. I look on transformers as some functions on top of data. https://docs.microsoft.com/en-us/dotnet/api/system.func-6?view=netframework-4.7.2 Func<in T1,in T2,in T3,in T4,in T5,out TResult> I don't see this idiom of putting output first in the way how C# views functions in general, I would prefer to have current order of most estimators to remain present.

Hi @Ivanidzo4ka ! Thanks for your feedback. I am not sure exactly what you mean. Are we talking about the same language? Indeed, let's take your precise example. The signature of that is very clear.

public delegate TResult Func<in T1,in T2,in T3,in T4,in T5,out TResult>(T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5);

I see a TResult first, then all these T1 and so on defined after that in the actual language syntax (whatever they chose to do with the generic parameters). I'm not trying to put you on the spot or anything, but I mean, it's right in the link you gave. Then I see the inputs. Indeed, every function declaration is like this in C#. Return type is first. Variable assignment is the same. In C#, Java, C++, C, Python, Ruby, it is something like this:

int a = b + 2;

not this

b + 2 => int a;

Indeed, most languages I can think of when assigning variables... we have F# with its let a = b + 2, other similar things elsewhere. I mean, I guess back when I was a kid I used HyperTalk which had syntax like put 5 * 4 into theResult, but I can't say too many other examples come to mind.

This of course assuming I accept the idea that we have to be analogous to a programming language, which I'm not sure I do, but if I did, it certainly appears at first glance to be utterly disastrous to your case. But in any case, generally I think the most important information should generally go first. What I want to know out of every transform is "what are you making, and how are you making it," not "how are you making something, and what is it called?"

This in addition to the advantages already enumerated originally, the consistency argument, etc. The fact that when transforms take multiple inputs many times those auxiliary helper inputs are optional (so should have null defaults), but the name of the output never is. So, I'm not sure what you're talking about. Without assigning blame, and without belaboring the point, this was clearly a blunder. Massive error. Let's fix it please.

Semi-related: We may also want to take a pass at ensuring the parameter names are consistent as I expect this to be difficult to correct after v1.0. I assume the work would be creating a large spread sheet of rows of components & columns of parameter names (and ensuring we don't create a new column needlessly).

@justinormont that's a good point actually, even for this. (I mean, you're right in the broader sense, but just going to stay on topic.) Not only did we screw up the ordering of the column names, we even screwed up their names. For years even with one-to-one transforms these have been named name (for the name of the output) followed by source (For the name of the single source, assuming there was a single source, which there often was not).

We have changed this to input and output, which is just confusing, because the "input" and "output" in the case of an IEstimator and ITransformer is already a very well defined intuitive concept. The input to an estimator is an IDataView provided to fit, and the output is the ITransformer. The input to every ITransformer is the data to be transformed, the output is the transformed data. To reuse the same term for this other thing when we already had a perfectly servicable name is bad.

So in addition to changing the order: these parameter should be named, as they are literally everywhere else, name and source, for the same of sanity and consistency.

stephentoub commented 5 years ago

Indeed, every function declaration is like this in C#

It's not about return values, it's about parameters and the order used in .NET APIs in general when an API has some kind of source parameter and some kind of destination parameter.

For example:

TomFinley commented 5 years ago

But this is my point. It's not always about a "source" and a "destination." There is almost always a destination, but the "source" is sometimes indeterminate, multiple, etc. We did not settle on this convention by accident.

You'll note also that these are all about filling in values in an already allocated buffer. This is not that. It's more like, "we're declaring this new slot named name and please fill it in with an operation parameterized in the following way with many parameters, some of which may be source columns from which values are created."

Ivanidzo4ka commented 5 years ago

but the "source" is sometimes indeterminate, multiple, etc.

This is why I try to bring Func as example, but I guess I wasn't clear. Func is exact same case, right? You have multiple inputs and one output, but if you look all of them defined as Func<in T1,out TResult> Func<in T1, in T2, out TResult> and not Func<out TResult, in T1, in T2> And in general I found idea of Func(destination, source1, source2,source etc) counter intuitive. And I see same words in mail which spark this issue. Why we need to change all other code in our codebase instead of having consistent clean pattern of Input first, output second?

TomFinley commented 5 years ago

but the "source" is sometimes indeterminate, multiple, etc.

This is why I try to bring Func as example, but I guess I wasn't clear. Func is exact same case, right? You have multiple inputs and one output, but if you look all of them defined as Func<in T1,out TResult> Func<in T1, in T2, out TResult> and not Func<out TResult, in T1, in T2>

Thank you @Ivanidzo4ka for trying to explain to me again. I think I do understand the example, I am trying to explain gently that it points in a different direction than you suppose. I'll go over a separate point from my original one, in case it helps illustrate why.

Let's actually look at a few of these extension methods, in particular, the string input, string output ones that you claim you are doing to be consistent with this more "intuitive" understanding based on Func<>. I will do so by searching the codebase for the string (this TransformsCatalog, in case you want to repeat the experiment, then looking through them, and finding the ones with this string inputColumn, string outputColumn pattern, or something similar.

Since this involves an enumeration and examination, this response will be somewhat long, I'm sorry to say. I titled the examples so they're easy to browse, and put my conclusion at the thing titled "Conclusion" more towards the end.

Simple Example 1

The first one I see is Hash. It is this.

https://github.com/dotnet/machinelearning/blob/c8755a20783ef9a424d1545ab706881c598c237d/src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs#L30-L31

I do not see the input parameters, followed by an output. Instead, I see one input paramter, followed by the output, followed by two more input parameters.

Simple Example 2

https://github.com/dotnet/machinelearning/blob/c8755a20783ef9a424d1545ab706881c598c237d/src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs#L66-L67

One input parameter, the output parameter, followed by one more input parameter.

Complex Example 1

Actually, now that I'm looking, there are a few more here that are worthy of inspection in this file; despite not directly filling my pattern, they indirectly do if we look at them. They also indirectly illustrate the practical reasons why we years ago arrived at the pattern that we did. These are, if anything, even more confused, since they have neither input nor output first!

This is one. You will see we have the column definition first, which happens to be a ColumnInfo. It is parameterized in this way.

https://github.com/dotnet/machinelearning/blob/c8755a20783ef9a424d1545ab706881c598c237d/src/Microsoft.ML.Data/Transforms/ValueToKeyMappingTransformer.cs#L184-L188

So this again has the pattern of input, followed by output, followed by four more inputs. The same confused mixing.

But let's return again then to the original linked code... we have more parameters. What are they? It is the name of the input file. Followed by the name of the input column from that file, followed by how to load this file. So we have more input "columns" calculated. So not only are we drawing an artificial distinction between the input column and other inputs to the estimator, we are not even being consistent with the input columns themselves.

https://github.com/dotnet/machinelearning/blob/c8755a20783ef9a424d1545ab706881c598c237d/src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs#L122-L124

Complex Example 2

https://github.com/dotnet/machinelearning/blob/c8755a20783ef9a424d1545ab706881c598c237d/src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs#L140-L144

Here we've arrived at something else. This actually fulfills the pattern that you claimed you wanted to fulfill, of the output is actually specified last!! Let me ask you this. Do you like it? 😄 Is it consistent with anything? Would we want anything to be consistent with it?

Simple Example 3

https://github.com/dotnet/machinelearning/blob/c8755a20783ef9a424d1545ab706881c598c237d/src/Microsoft.ML.Data/Transforms/ExtensionsCatalog.cs#L21

This one actually fills your pattern, but only by virtue of having no parameterization except the name of the input, which is far from typical.

There are a few more in here, but they're somewhat similar and confused (keep and drop which are kind of special in the sense that they describe only inputs, or only outputs, depending on how you look at it. Also actually concatenate, which is one of the few conversions that actually wound up being correct, and which sparked this conversation.)

Simple Examples Batch 4

Next up, normalizers. Only one here that directly fulfills the pattern.

https://github.com/dotnet/machinelearning/blob/c8755a20783ef9a424d1545ab706881c598c237d/src/Microsoft.ML.Data/Transforms/NormalizerCatalog.cs#L30-L32

Same confused pattern. Input, output, input.

https://github.com/dotnet/machinelearning/blob/c8755a20783ef9a424d1545ab706881c598c237d/src/Microsoft.ML.Transforms/Text/TextCatalog.cs#L45-L48

Ooo. Inputs (as IEnumerable), output, more input (or more like a collection of input).

https://github.com/dotnet/machinelearning/blob/c8755a20783ef9a424d1545ab706881c598c237d/src/Microsoft.ML.Transforms/Text/TextCatalog.cs#L59-L62

Same confused pattern. Input output input.

Conclusion

I hope you don't mind, I had intended to keep going, but, it's the kids' bedtime, I think by this point, it's getting fairly redundant, and I'm starting to belabor the point. So this is what you said:

And in general I found idea of Func(destination, source1, source2,source etc) counter intuitive.

That's fine. But if I were to actually put the destination last, past all other things we could reasonably call inputs, then the result would not be what we really want, since many of these "functional inputs" have default values, etc. You seem to favor neither approach. Instead, what you favor, to judge from your vigorous defense, is something chimerical: Func<TIn1, TResult, TIn2, TIn3> or Func<TIn, TResult, TIn2>. I mean, we could continue -- maybe we want to argue that some inputs go before the result, some inputs go after, but, let's please not. It's been a long enough day already.

And I see same words in mail which spark this issue. Why we need to change all other code in our codebase instead of having consistent clean pattern of Input first, output second?

As we see in the original text of the issue, this is somewhat reversed. This new pattern introduced a few months ago is inconsistent with our longstanding practices. The mail is informative in pointing out the discrepancy and other issues (which we must fix), but we must decide how. And, sometimes (in fact, always) specific suggestions made spur the moment have to be carefully considered if they are in fact wise. That this blunder got repeated so many times that it somehow got established as "the way" by some people is unfortunate, but, again, it must be fixed.

I hope it's obvious that I really value your input @Ivanidzo4ka and @stephentoub. You are enormously helpful. I generally find myself in agreement with you on most things, and I hope one disagreement doesn't put you off. But, unpleasant as I find it to do so, I can't agree in this case. I'm sorry.

Ivanidzo4ka commented 5 years ago

If you look on all that cases as input, output, input, I completely agree with you, it's awful way to design Api. But I view that as input, output, params, and in that case, I find it a nice and convenient pattern to use in code.

I've check examples you linked and all of them follow input, output, params (where params is optional) except Complex Example 2 which looks awful.

I've look into links provided by @stephentoub and they either use same pattern input, output, optional params or input, params for inputs, output, params for output. I have troubles to find methods which would accept output first, and then input, but that's probably my lack of searching skills.

You seem to favor neither approach. Instead, what you favor, to judge from your vigorous defense, is something chimerical: Func<TIn1, TResult, TIn2, TIn3> or Func<TIn, TResult, TIn2>.

I was just try to point out what I don't see pattern where people specify output result first.

This new pattern introduced a few months ago is inconsistent with our longstanding practices. The mail is informative in pointing out the discrepancy and other issues (which we must fix), but we must decide how.

Again, let me disagree, we don't have longstanding practices of developing good public API. I agree we should have consistent API, I just disagree with output first, everything is later approach.

But in any case, generally I think the most important information should generally go first. What I want to know out of every transform is "what are you making, and how are you making it," not "how are you making something, and what is it called?"

I hope I correctly depicted your main reasoning behind having output column first and input columns second. You find it more practical and this is why you want output to be first. I think it's a subjective statement, and I struggle to find any examples of good C# code where that approach was implemented. (That's again subjective of me, to say good C# code)

So basically we are in holy war state, Little endians vs big endians. How civilized humans decide such conflicts?

TomFinley commented 5 years ago

If you look on all that cases as input, output, input, I completely agree with you, it's awful way to design Api. But I view that as input, output, params, and in that case, I find it a nice and convenient pattern to use in code.

So in your first version of the argument there are "inputs" and "outputs." Then I point out it is not so. Now there are "inputs" and "outputs" and "parameters." So let me ask, what is an "input?" What is a "parameter?" Can you define it for me, precisely and clearly in a way I can apply and explain in all cases? I repeat again the lesson taught in my prior "Complex Example 1"...

https://github.com/dotnet/machinelearning/blob/c8755a20783ef9a424d1545ab706881c598c237d/src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs#L122-L126

So, in this case, the "inputs" vs. "parameters," I guess, are that the things defining the terms with its file and column names are "parameters." How is this decided? What is the correct split then? Is the termsColumn an "input" or a "parameter?" It looks kind of like an "input" to me. So should it go first? How about the file name? It seems odd that this should be separated, so that is also an input? What about the loader factory? An input, since it should probably not be separated from those?

You see, I hope, why I don't find this idea of being "inputs" and "parameters" convincing.

It is true certainly that sometimes the API must differ from the tool that's been around for years. I have argued this myself when it makes sense on many occasions. An API is a different environment after all, some problems have different natural solutions given that we're not writing a GUI or command line tool. But usually, you'll note that when we have a disparity, we actually provide reasons. In this case, it is unclear to me that the reasons why we did things they way we've done for years have changed. Certainly I do not find merely mentioning that we sometimes do it sufficient justification.

Also, not to put too fine a point on it, but I did explicitly already talk about how futile it was to try to draw a sharp line between "some inputs go before the result, some inputs go after."

I've look into links provided by @stephentoub ... I was just try to point out what I don't see pattern where people specify output result first.

Putting stuff into an already established (and, yes, named) buffer was not an analogy I found compelling. It doesn't reflect what's going on at all. Far closer is creating something new and giving it a name. In such APIs like that, I almost always see the name is the created thing is usually before anything related to its value, or parameterization.

So: you're not providing an output. The output does not exist yet. You're creating it, right then. You're naming it. In any such API I can find where you're naming something and then providing how it is defined, when naming a new item and defining it, usually it is the name of the new item that comes first, then the content.

I mean, even in the most basic data structures. Most people call them "key-value" and not "value-key" stores for a reason. Right? Plus of course I already of course pointed out that every C derived language puts the name before the value as part of its basic syntax.

(Please, I know that I write somewhat voluminously, but, every last thing you're bringing up I've already covered. I know, I know, "tl;dr", but, please do. I'm spending so much time on this because I take it seriously.)

So basically we are in holy war state, Little endians vs big endians. How civilized humans decide such conflicts?

First you try to avoid the conflict in the first place, by trying to reach consensus. This happens of course 99% of the time. In the remaining 1% where someone is insistent in the face of all argument, ultimately, in this case as in all cases we must make a single decision, since the code cannot be in superposition. A choice must be made. So, I try to be as gentle and polite as I can, explaining the decision, continue to try to reach consensus. (Again, I invite you to read my arguments.) If it still doesn't work, well... we obviously can't continue forever, so the hard decision is made.

We've done this in some other occasions. With the entry-point based API, the "predicting model" of the v0.1 API, the ideas that mutable estimators are a good idea. This one thing isn't the first, and I'm sure won't be the last. I always find it deeply unpleasant, but, what can you do.

glebuk commented 5 years ago

Leaving the order of arguments aside, let's discuss column argument names. I've always found the current default of source/name to be quite confusing.

Let's look at the current state.

   ColumnCopyingEstimator ( string input, string output) 
   ColumnCopyingEstimator (params (string source, string name)[] columns)
   NormalizingEstimator Normalize(… params (string input, string output)[] columns)
   NormalizingEstimator Normalize(...params NormalizingEstimator.ColumnBase[] columns)
     /* where the ctor is: */      ColumnBase(string input, string output)

We have at least three way to declare inputs and outputs: via strings, via tuples and via Column classes. We should come up with a solution that is readable and consistent between all those usages.

However, what are the alternatives? There at least the following options to consider, where the same name will be used in all possible ways to define columns: 1. {name, source} - the "name" is ambigious and assymetrical with "source" 2. {input, output} - concise and clear, but can be confused for IDV input/output - but is this an issue in this case? There are no input/output IDVs as constructor args anywhere... 3. {source, destination} 4. {inputColumn, outputColumn} - most clear but most verbose

Thoughts? My personal ordered preference is: # 3, # 2, # 4, # 1.

TomFinley commented 5 years ago

Let's look at the current state.

   ColumnCopyingEstimator ( string input, string output) 
   ColumnCopyingEstimator (params (string source, string name)[] columns)
   NormalizingEstimator Normalize(… params (string input, string output)[] columns)
   NormalizingEstimator Normalize(...params NormalizingEstimator.ColumnBase[] columns)
     /* where the ctor is: */      ColumnBase(string input, string output)

We have at least three way to declare inputs and outputs: via strings, via tuples and via Column classes. We should come up with a solution that is readable and consistent between all those usages.

You will note that all of these things are, each last one of them., estimators. The fact that estimators diverged from our already established pattern is the subject of this issue. We already have a solution ready to hand: make this new IEstimator idiom introduced a few months ago consistent with what we have done for years.

However, what are the alternatives? There at least the following options to consider, where the same name will be used in all possible ways to define columns: 1. {name, source} - the "name" is ambigious and assymetrical with "source" 2. {input, output} - concise and clear, but can be confused for IDV input/output - but is this an issue in this case? There are no input/output IDVs as args anywhere...

Again, this discussion concerns IEstimators primarily. I will quote it in its entirety:

https://github.com/dotnet/machinelearning/blob/1b6b3c3ab22a79e593913fbc9397cef93256aa9c/src/Microsoft.ML.Core/Data/IEstimator.cs#L300-L313

You will note two methods. One for schema propagation prior to fitting and one for actually fitting. One takes the input's schema (or more precisely, the more provisional parallel structure, a "schema shape") to perform preliminary schema validation and propagation -- that input named inputSchema, I hope you observe. The second is fitting and returning a transform given data. That data is given the name, again, input. You will note a striking similarity in the definition of ITransformer. It is in the same file.

I mean, It's right there in the interface signature of the structure that is the subject of this discussion, right?

Thoughts? My personal ordered preference is: # 3, # 2, # 4, # 1.

Am I to understand that 3 is your first choice and 1 is your most disfavored choice?

You would favor renaming what we currently call "name" to "destination." In fact that one change alone would shift your opinion from liking the solution we are pursuing least to liking it best. In fact that would be absolutely perfectly clear in your head.

Let's be clear. These estimators are, in some sense, creating new columns. Everywhere* that we talk about columns, what is the primary identifier for a column? Always "name." In Schema.Column, in Schema when accessing columns, in loaders when specifying columns, the array data view builders, the attributes for creating a data view out of a C# type, the dataview design principles that I hope we're all very familiar with...

Let's not even talk about our code, but all similar code like SQL and data frames and Parquet files. What do they call the string identifier they use to reference a column? The column's name. Not only are we internally consistent, but we are consistent with longstanding usage of this sort of structure.

But suddenly, now, after years of calling this thing "name" everywhere, we have suddenly had an epiphany that in estimators specifically (and only there, apparently) calling it name is now incorrect, and now it should be called destination. How can you possibly account for this disparity with our usage of this structure? Why would we introduce such an inconsistency?

You also asked for thoughts, but I will express them privately.

glebuk commented 5 years ago

We have several places where the concepts of "input column names" and "output column names " are used. Namely:

   // simple-typed estimator ctor args: 
   EstimatorFactoryMethodFoo(string name, string source) 
   // tuple-typed ctor args:  
   EstimatorFactoryMethodFoo (string name, string source)[] columns)
   // column info-typed ctor args:  
   EstimatorFactoryMethodFoo (Foo.ColumnBase[] columns)
        // where the Column type has fields 
        class ColumnBase  { string Name; string Source; }
   // readers:
   CreateTextReader(... TextLoader.Column[] columns)
        // where Column ctor as defined as
        public Column(string name, DataKind? type, int index)
   // and a few others such as command line args and entrypoints

If we want the variable/field naming consistent among all of those usages, we will have to stay with the original naming convention of the {name, source}, at the expense of fluency in case of simple-type ctor usage.

@KrzysztofCwalina @eerhardt @terrajobst - can you please weigh in?

KrzysztofCwalina commented 5 years ago

In new .NET APIs (Span transformations) we use source/destination. I think I started to use the names (in the transformation APIs) because in many existing APIs where we actually use input/output (or in/out) the direction of data flow is actually not very intuitive. For example, you might be surprised to discover that Console.In is actually something that returns TextReader that the callers read (duh).

eerhardt commented 5 years ago

Obviously there are multiple topics being discussed in this issue. And, as far as I can tell, we probably won't be able to reach a consensus without some sort real-time meeting. However, I wanted to weigh in on my opinions:

  1. I don't think the parameter name name fits correctly with Estimators. If you are creating a new Estimator, and the parameter into the Estimator's constructor is name, then that would indicate to me that this is the name of the Estimator. Yes, when referring to Columns, the name Name is used because it is the name of the Column. Somehow, users are supposed to infer that this isn't the name of the Estimator, but instead the name of the column being created by the Estimator.
    • I do think that inputColumn/outputColumn or sourceColumn/destinationColumn would be a better choice.
  2. On the ordering topic, I also agree that the order "input columns" then "output columns" makes more sense.

since if you're specifying a transformation the most important information someone will want to know is what you're calling your output!

I don't agree that's where you'd logically start, or even how you'd want to read the code. Looking at Apache Spark's Estimators, which we've publicly said we modeled the principles and naming from), they have their samples/docs with input column then output column:

Scala:
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.001)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))
Python:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

(Note they also refer to these things as "inputCol" and "outputCol".)

In this case, this is exact prior art for naming and ordering. This is the same scenario we are solving. Of course we shouldn't just make the decision based on what others are doing. But it does give more weight to which approach to take if others have already taken this path.

One other, minor, reasoning for having input first, then output, is that it reads more like a book (top to bottom, left to right) if you do that ordering.

input-first, output-second

.Append(new Estimator1("Column1", "Column2"))
.Append(new Estimator2("Column2", "Column3")
.Append(new Estimator3("Column3", "Column4")
etc.

See how Column2 is the last thing on the first line, and the first thing on the second line. And Column3 is the last thing on the second line, and first thing on the third.

Whereas, if you flip it, it doesn't read has nice. The "flow" is interrupted.

output-first, input-second

.Append(new Estimator1("Column2", "Column1"))
.Append(new Estimator2("Column3", "Column2")
.Append(new Estimator3("Column4", "Column3")
etc.
TomFinley commented 5 years ago

@eerhardt , you've summarized I think the reasons why @Zruty0 thought that the arrangement might be a more intuitive one, which, again, I don't dispute; certainly people find it intuitive. Thank you for that then. So, I'm going to explain as best I can how this conversation could become more useful. (Not with respect to you specifically.)

I get the sense people are thinking about this radically differently from how I am. Perhaps the best way to start is to summarize how we got here. (This according to my best understanding at this moment. As near as I can tell, the current state was reached by a single line item of a part of the proposal in #1318; which is to say, this inconsistency that was introduced was not accidental as I originally thought, but a purposeful act.)

If they take input or output colunms, they should be 'inputColumn' and 'outputColumn', and the outputColumn should be nullable.

Fast forward a few months. The practical effects of putting that proposal into effect seem to result in a series of problems, which I've attempted to summarize both in terms of general trends and specific examples. I also have a specific proposal, which is, "hey, looks like this new way didn't work out, the old way existed for a reason and results in a more logically consistent API surface, let's go back ot that." OK so far?

So, now we come to why I am not finding this discussion as productive as I might like: merely saying "I like the proposal in #1318" or "I like having inputs then outputs" or "it's like Spark," does not really help. That observation, correct or incorrect though it may be, by itself it does nothing to help solve the problems with consistency that seem to arise out of that choice.

Indeed, I have not seen that anyone is even paying attention to or attempting to address the problems with consistency that have been identified. This is not to say that I think that my points are unassailable or perfect. I'd be surprised if they were; I've been wrong too often to suppose that. But literally no one is bothering to dispute them. Or telling me anything what I'd like to hear like, "oh, yes, I see how it is problematic in this case... how about we solve it via the following scheme," and then laying out a specific proposal that seems to assuage my concerns in any fashion.

Instead, this is what is happening: we just rehash the suggestion in #1318 again and again and again in one way or another (by direct appeals to intuitiveness, through analogies to this or that other API), which by itself does not help. Now then, intuition and analogy are great supporting arguments to a specific proposal, but no one is giving a specific proposal except, "the original idea was perfect," but I already know it is not. I'm being invited repeatedly to look at how pretty it is when there's one input and one output -- yes, that's great, but can we move to the next level of the discussion please? Or, if unwilling or unable to do so, at least please acknowledge that maybe there's a deeper issue being discussed than merely that? Please?

To give just one example, it's been about a week since I asked for a clear definition of what is a "parameter" vs. an "input." Still haven't seen anything. But this is not unique. We also see above, again, yet another iteration of the implicit assumption that estimators are things that map columns to columns, but spares, as usual, not one word about those cases where they don't, what specifically to do about them, the problems arising from the current schemes, etc.

So here's what I'd like. Many of them can be viewed in this way. What do we do about the ones that can't? Can we come up with a logically consistent scheme for them? What about not just estimators specifically, but other ways of "creating" columns? If you can in a way that addresses the problems, do so. I literally cannot do anything without that. Further, try to argue against my proposal. If we have a proposal that seems to view them as logically consistently not just with other estimators, but other ways with creating columns in an IDataView, what is wrong with that proposal? Is there a problem with the proposal that I have not identified?

Again, I'm not saying these questions are not answerable. But, no one is bothering to answer them. So I am, again, not finding this discussion productive.

So displeased though you may be, look at it from my perspective. I have to solve a practical problem that seem to arise out of a proposal that was acted upon. I have a plan to do so, which I have attempted in other posts to talk about why it is appealing to me. I am sympathetic to the idea that some people like the original proposal, but that sympathy does not help draw me any closer to a solution with logical consistency. By itself it just makes my job just that much more unpleasant.

eerhardt commented 5 years ago

There is a literally a wall of text in this issue, and trying to find the succinct reasons why name and source (and in that order) is the best approach would take someone at least 30-40 minutes trying to discern them. Can you succinctly list them in the original post of the issue? From the original post, this is what I can tell are your arguments:

  1. For transforms, there are a finite number of "output" columns (almost always 1). There can be 0 to many "input" columns.
  2. It has always been done that way.
  3. It is done that way in the command-line.

Are there others?

I can maybe be convinced of argument (1) above as being a valid reason for the order. But I don't see any valid reasons for naming the parameters name and source. That makes no sense to me at all.

TomFinley commented 5 years ago

Because elsewhere when naming new columns that is called it's name and that argument is always first, as far as I see. It's not clear to me that estimators are exceptional, and I see by direct observation that treating them as if they are exceptional leads to problems.

Regarding wall of text that takes 30 or 40 minutes to discern, it may be that when it comes to consistency of one hundred components, sometimes appreciating the issue in detail may be required to hold a conversation that can go anywhere.

Also we again see the problem that I am having with this discussion. We have dissatisfaction expressed, no alternative is being proposed. What am I supposed to do with that?

eerhardt commented 5 years ago

We have dissatisfaction expressed, no alternative is being proposed.

I'm trying to understand all the issues being raised before proposing an alternative.

What am I supposed to do with that?

I would expect someone to read and understand the feedback, and think about any merits the feedback has.

Because elsewhere when naming new columns that is called it's name and that argument is always first, as far as I see.

I really want to understand why you are so insistent on the name of the output/destination column parameter be simply name, and the name of the input/source column parameter be source. You say that's the way it was everywhere - but that is regardless of the context in which it is being used?

I totally understand when you have a class/struct that is named something something Column, and it has a set of properties, like in the OneToOneColumn and ManyToOneColumn classes:

https://github.com/dotnet/machinelearning/blob/a570da14a41f2870eb8f61d84496a58422398253/src/Microsoft.ML.Data/Transforms/ColumnBindingsBase.cs#L122-L128

This makes sense because of the context in which the properties are being used - they are in a Column class. The usage of the word name is obvious - it refers to the name of the column.

However, I am not convinced that it is obvious, or even intuitive, that when I am creating an Estimator and asked to provide a name that I am really being asked for the name of the output column. I am no longer in the context of a Column.

Is the only reason you are rejecting having the parameter names being destinationColumn and sourceColumn(s) (or output/input) because you feel it is inconsistent with the properties Name and Source on classes like the ManyToOneColumn class above? Or are there other reasons?

Zruty0 commented 5 years ago

OK, I guess I need to weigh in as well. First of all, definitions:

@TomFinley is right: it was my suggestion to change to the current scheme ((string inputColumn, string outputColumn = null, all_other_params)), because I found it very 'natural' for the majority case.

The majority case is that we have 'one to one column mapping estimators'. So, in the ideal world, we would have the only factory method to be mlContext.FooEstimator(string inputColumn, string outputColumn = null). Uncontroversial so far?

Even in this simple case, there is complexity:

So, for regular rank-and-file 'one-to-one mapping', we already have a matrix of possible factory methods:

. No extra params Has extra params
One column pair string inputColumn, string outputColumn = null string inputColumn, string outputColumn = null, otherParams
Multiple column pairs (string inputColumn, string outputColumn)[] ColumnInfo[] columns

In the hindsight, I think this 4-way matrix was not necessary. I would suggest to trim down to only 2 options:

. No extra params Has extra params
One column pair string inputColumn, string outputColumn = null string inputColumn, string outputColumn = null, otherParams
Multiple column pairs ColumnInfo[] columns ColumnInfo[] columns

Arguably, maybe we can take it even further? We could declare that the ONLY special treatment applies to SINGLE column pair with NO extra params:

. No extra params Has extra params
One column pair string inputColumn, string outputColumn = null params ColumnInfo[] columns
Multiple column pairs params ColumnInfo[] columns params ColumnInfo[] columns

In any case, let us move from the simple majority case to the exceptions. I can identify three kinds of exceptions:

In this case, it looks hard to define signature consistent with the above (where we specify inputs, then outputs, then other params). However, we can still do it:

mlContext.ManyToOneEstimator(string[] inputColumns, string outputColumn, extraParams);

Not exactly identical to the above, but the pattern is the same. More than one input column - good. Not nullable output - also good (since we no longer assume that output column name defaults to the input name).

Again, in case of any multiple-mapping scenarios we can back off to the 'standard' column-object approach:

mlContext.ManyToOneEstimator(params ColumnInfo[] columns);

I believe it's tricky to have a consistent interface for all these cases, but I think we can strive to adopt some form of unification. <0 or more input columns> <0 or more output columns> <all other parameters> Essentially, we could just prohibit output columns to appear before inputs, and parameters to appear before inputs or outputs.

Again, when applicable, we can back off to the good old params ColumnInfo[] columns.

Zruty0 commented 5 years ago

So, TL;DR version of the above is: let's put all inputs first, then all outputs, then all other params.

Now let me go over Tom's examples and try to come up with a unified conversion.

Simple 1

 public static HashingEstimator Hash(this TransformsCatalog.ConversionTransforms catalog, string inputColumn, string outputColumn = null, 
     int hashBits = HashDefaults.HashBits, int invertHash = HashDefaults.InvertHash) 

Simple 2

public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.ConversionTransforms catalog, string inputColumn)

This looks just like a bug to me. Should have the output column.

Complex 2

        public static ValueMappingEstimator<TInputType, TOutputType> ValueMap<TInputType, TOutputType>(
            this TransformsCatalog.ConversionTransforms catalog,
            IEnumerable<TInputType> keys,
            IEnumerable<TOutputType> values,
            params (string source, string name)[] columns)

This looks really like an oddball: parameters before everything. I am not sure what the thinking was behind it.

Simple 3

public static ColumnCopyingEstimator CopyColumns(this TransformsCatalog catalog, string inputColumn, string outputColumn)

Text

 public static TextFeaturizingEstimator FeaturizeText(this TransformsCatalog.TextTransforms catalog, 
     IEnumerable<string> inputColumns, 
     string outputColumn, 
     Action<TextFeaturizingEstimator.Settings> advancedSettings = null) 

Keep, Drop These are the only estimators that drop columns, but we can still make them mostly fit the pattern:

Keep(params string[] outputColumns);
Keep(IEnumerable<string> outputColumns, bool keepHidden);
Drop(params string[] inputColumns);
Drop(IEnumerable<string> inputColumns, bool keepHidden);

Concat Lastly, concat, I think, will suffer the most in terms of verbosity. But having params string[] inputColumns, while concise, goes against the proposed pattern, so I'm OK with this.

Concat(IEnumerable<string>inputColumns, string outputColumn);
Concat(params ColumnInfo[] columns);
Zruty0 commented 5 years ago

Reading through the rest of the comments.

  1. I second @Ivanidzo4ka and @glebuk 's sentiment that there the 'destination before source' pattern is really bizarre and we should not have it.

  2. I agree with @KrzysztofCwalina 's suggestion of input/output -> source/destination. sourceColumn and destColumn sound like a good pair to me (consistent with File.Copy again).

sfilipi commented 5 years ago

Moving my old comment from the PR to the issue, so all discussion is in one place, and adding a note about defaults.

1- I also prefer the sourceColumn, destinationColumn (anything but the word "name" for the generated column). "name" as an API parameter is not meaningful. I agree that we are not going to be consistent with out command line usage, and entry point usage, but i think that is worth the user experience.

2- After having worked through the changes adapting the new proposal, my biggest problem is that we are assigning default values to the source column, rather than making it mandatory. So by calling this:

ml.Transforms.Normalize("Induced");

on the api with the new signature:

public static NormalizingEstimator Normalize(this TransformsCatalog catalog,
           string name,
           string source = null,
            NormalizingEstimator.NormalizerMode mode = NormalizingEstimator.NormalizerMode.MinMax)

The users need to know that there exists an input column with the name "Induced", and that if they assign that string to the output/name column param, everything will be fine.

TomFinley commented 5 years ago

Here, we are not 100% consistent (output column is not nullable, although we'd like it to be). This is because the params are required, there is no default behavior. I think this is small price to pay for the streamlined order of 'inputs, outputs, params'.

Let's stop right there, because this is the root of where we've gone down different paths, and if we can't agree on this, we can't agree on anything. This is the root of why I consider your scheme dead wrong. You gloss over it, but this is the point of why we decided a while ago to go in the opposite direction back when @zeahmed was writing the convenience constructors for transforms.

Look at this discussion. People have the idea of "input" and "output" so baked into their heads that when I asked them to consider alternate cases where this wasn't quite so simple, most of them simply could not continue the discussion. How many weeks passed before someone besides me even bothered to mention this other case? That's how basic and fundamental this idea is, that people didn't even think about alternates.

That in mind, let's now think about what you've just said: You've observed, as I did, that as soon as this "other required arguments" thing pops up, the scheme falls to pieces. We're talking about inconsistency of the most basic part of estimators. And your response to such a fundamental inconsistency is "small price to pay?" Really? No, no no no... that is a huge price to pay. This is literally the most basic thing we have to get right. My premise is that we must be consistent there, and if our scheme does not allow us to be consistent, then it is discarded. Hence this issue, where I tried to undo all this, since I view it as a gigantic blunder.

But, you know what, that's all fine. Since everyone agrees this pattern is great, maybe I'm wrong. I disagree and think you're committing an error that will haunt this API, but it is simply impossible for me to continue this discussion, however strongly I think you're all making a huge blunder. So, do what you like here.

Just, some of you, if you're at least willing to consider the possibility that this inconsistency represents a fatal flaw... for your sake at least, I'll retrace my steps that I took with @zeahmed this past summer. If you continue on this path, what you'll I hope realize is that it is useless to talk about "input" then "output" parameters and say one is necessarily first then second, but due to the basic syntax of C# there is only required and unrequired parameters. If we want either one of the "input" or "output" [sic] column names to be non-required, then you will find based on an examination of actual cases that it is the "inputs" that tend to be most polymorphic w.r.t. their behavior (in the sense that they are sometimes absent, optional, or whatever), and so most naturally group them among unrequired. Then among the required you will arrive at the scheme of the "output" column name being first, followed by any other required parameters (if any). But, you haven't, and that's fine. I'm only one person and I can't fight against every thing.

markusweimer commented 5 years ago

Regarding the ordering of inputs and outputs, I believe there is other important related work to consider: SQL. There, the output is specified before the input: SELECT something FROM somewhere. I believe this is precisely because the output is "new", and usually short, while the input can be hideously complex with structure in its own right.

@interesaaat could be in a position to (dis)proof this perspective.

eerhardt commented 5 years ago

Here, we are not 100% consistent (output column is not nullable, although we'd like it to be)

In general, the proposal of having the output column being nullable is a bit strange to me. Instead of having a pattern:

FooEstimator(string inputColumn, string outputColumn = null, ...)

What if we split these methods into two overloads:

FooEstimator(string column, ...) FooEstimator(string inputColumn, string outputColumn, ...)

If we have a 1-1 transform, and we want the ease of just specifying a single column name, and not the same name for both the input and output columns, then use overloads instead of nullable/default parameters for one of the columns.

Would that make everything consistent?

interesaaat commented 5 years ago

Replying to Markus: you are right, the output goes last and I think that this is because it ease composability. SQL queries (as well as Datalog for that matter) are in fact composable, and it is much easier to get the output schema if is specified as first than if it is instead baked into the query (or last). Nevertheless, when you write queries, we always start from the input schema (it is difficult to derive the output schema without knowing what you get as input). So in practice, output schema goes first because it is easier to use the query afterwards, but when you write it, you first look at the input schema.

This being said, the output first and input later looks odd to me in a "functional API", and since apparently this is true for the majority of people, if I can put my two cents, I would rather go with an inconsistent API (my understanding---I really tried to read everything but I failed---is that the bigger inconsistency is that output is sometimes not null?) than with something that people have problem understanding/using.

TomFinley commented 5 years ago

Would that make everything consistent?

I think that falls apart for reasons of confusion, and maybe consistency, once we consider that some of these parameters are very often strings themselves. That is, your ... contains strings. (Which I know will happen lots of times, of course.)

So, without checking, let me ask you a question about .NET. We will use a hypothetical simple example to guide our thinking.

void Foo(string input, string optionalInput = null);
void Foo(string input, string output, string optionalInput = null);

Then imagine I have some other code further down that calls this method.

Foo("hello", "friend");

I'll tell you frankly, I don't know what will happen here. We might say I'm just ignorant, but I've been using .NET professionally for a decade now. I know a fair amount about its behavior, but this just seems confusing to me. What does the C# compiler do here? Does it complain? Does it favor the second one because that one is the one with two required arguments and I provided two arguments?

If it prefers the second, let us imagine a user. They write this.

Foo("hello");

They read the documentation. They see the first overload. Then they say, "hey, this optional parameter does something I want, and it is optional. They naturally change it to be this.

Foo("hello", "friend");

But lurking in ths shadows is this overload you made me add.

To be honest, I have absolutely no idea what C# does in the face of a situation like this. (I am about to find out now, but I wanted to write the above, to make the point that I, a 10 year C# veteran working with it day in day out, have absolutely no idea what will happen in this hypothetical API.

So let me find out...

public static class Program
{
    public interface Bubba
    {
        void Foo(string input, string optionalInput = null);
        void Foo(string input, string output, string optionalInput = null);
    }

    private static void Main(string[] args)
    {
        Bubba bubs = default;
        bubs.Foo("hello", "friend");
    }
}

OK. Apparently against all my guesses, it did not complain, but it also did not prefer the second, it preferred the first. That I'd argue is even worse from this specific perspective. Let's imagine I am trying to map an input to an output column, it's not even clear to me at that point how I'd write that code. (I mean, I guess I could be explicit and write bubs.Foo("hello", "friend", null); or bubs.Foo("hello", output: "friend"); to be explicit.

But, I'd argue that your proposal has, in this common case, led to a worse not a better world. So I'm afraid I not enthusiastic about this "overload" idea.

(There's also the secondary point about it entailing a lot of work, etc. etc. etc., which we ought not to be blind to, but that's a secondary point.)

This being said, the output first and input later looks odd to me in a "functional API", and since apparently this is true for the majority of people, if I can put my two cents, I would rather go with an inconsistent API (my understanding---I really tried to read everything but I failed---is that the bigger inconsistency is that output is sometimes not null?) than with something that people have problem understanding/using.

I have a different view. Adjustment of initial prejudices to "get" a consistent API, even if this hypothetical person initially has trouble "getting" it, is a cost paid once. Being inconsistent is a cost paid forever. I have a clear preference in this case.

But, if all the .NET people are screaming at me that I'm wrong, they know API design better than I do, so... whatevs. Do what you want.

sfilipi commented 5 years ago

After discussing this with @Ivanidzo4ka @TomFinley @eerhardt @markusweimer the decision is to:

@stephentoub @Zruty0 @glebuk @zeahmed @justinormont @KrzysztofCwalina @interesaaat thank you all for your input.

cc @terrajobst in case he wants to discuss it further on the API review meetings. cc @CESARDELATORRE

sfilipi commented 5 years ago

Fixed with #2239