Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 979 forks source link

Integration with magrittr #1208

Closed my-R-help closed 4 years ago

my-R-help commented 9 years ago

This is a feature request following the discussion on the mailing list.

I think it would be useful to have something like this as a short-hand form:

DT[, a %<>% some.function] 

So far one has to type

DT[, a := a %>% some.function]

or without magrittr

DT[, a := some.function(a)]

This is particularly important if a is replaced with a variable that has a long name, which is then difficult to type and read. I think there are significant savings in (programmer) efficiency to be made here, especially with longish variable names.

DavidArenburg commented 9 years ago
DT[, a := some.function(a)]

Works perfectly fine

nr0cinu commented 9 years ago

But imho

Complicated_data_table_variable_name[, a_very_very_very_very_long_variable_name := some.function(a_very_very_very_very_long_variable_name)]

isn’t perfectly fine. I like the idea of adding this convenience function.

But maybe %:>% would be better than %<>%?

DavidArenburg commented 9 years ago

You shouldn't have such strange names in your data set. It is both inconvenient and hardly maintainable. Other than that, you can store the column name in some variable and then do:

shortname <- "a_very_very_very_very_long_variable_name"
DT[, (shortname) := some.function(get(shortname))]
my-R-help commented 9 years ago

You're right, but even with variables that have intermediate length I still find the magrittr syntax much more convenient to read and write. Anyway, this is just my personal opinion.

nr0cinu commented 9 years ago

I find that it’s sometimes better to have long variable names in complex data sets to make it clear what is saved in a variable. It is a matter of personal preference. Convenience function are per definition not required to perform a task, they just make it faster to code and often easier to understand. I have no doubt that this function would be of use to many users. But I also understand if the data.table devs don’t want to implement/maintain (too many) convenience functions, you have to draw the line somewhere ;)

geneorama commented 9 years ago

For those of you who are subscribed to this thread, please disregard the last comment (now deleted). It was silly.

mattdowle commented 9 years ago

Building from @and3k 's comment, I see some value of :

DT[, a %:=>% some.function]

Think that reads better (i.e. a := and %>% together). It's a 'happy pipe'? I'm a fan of efforts to reduce variable name repetition, as written here: http://stackoverflow.com/a/10758086/403310

jangorecki commented 9 years ago

The => part of the :=> operator has some extra meaning, maybe :=: ?

DT[, a %:=:% some.function]

or :=. which directly maps as := followed by . passed to fun

mattdowle commented 9 years ago

What extra meaning does => have? The > is nice because it conveys passing the LHS as an argument to RHS. Which is why Hadley changed from the original %.% to %>%.

eantonya commented 9 years ago

My understanding was that a major part of motivation for moving to %>% was that it's much easier to type than %.% (I'm guessing a lot of the times trying to type %.% would accidentally result in %>%).

jangorecki commented 9 years ago

I mean greater or equal operator. And what about %:>% ? This would be easier to type than %:=>% or %:=:%. and3k already mention that one above.

franknarf1 commented 9 years ago

My vote is for %:>% or just :>.

The %'s are only there because R doesn't allow infix operators in the wild, right? Might as well keep the operators inside DT[] parsimonious.

mattdowle commented 9 years ago

Hadn't considered the typing aspect i.e. holding down shift for all characters in the operator is easier I assume. Makes sense. :> doesn't parse unfortunately. What's inside [...] still has to be valid R syntax (all arguments are parsed always before being passed unevaluated to the function) so we can't make up new operators inside [...], still have to wrap with %'s. Ok then %:>% looks good to me as well. Not like it's a huge priority but it wouldn't be hard to implement and good to have discussed.

my-R-help commented 9 years ago

Thank, %:>% looks good to me.

Just curious, why does :> not parse while := parses inside [....]? := isn't valid R syntax as well, is it?

jangorecki commented 9 years ago

@my-R-help it is valid syntax, see this Why is := allowed as an infix operator?

ctbrown commented 7 years ago

+1 I agree with the OP feature request and use of the magrittr syntax. It is the best and most obvious choice for several reasons.

I strongly encourage you to not overthink this FR by introducing a new operator whose choice is just as arbitrary magrittr's choice was..

franknarf1 commented 7 years ago

I strongly encourage you to not overthink this FR by introducing a new operator whose choice is just as arbitrary magrittr's choice was..

@ctbrown The proposal is for a pipe operator that does something different from the vanilla %>% from magrittr, a package that also has several other pipe operators. As long as it doesn't conflict with any of those, what's the problem?

ctbrown commented 7 years ago

I think the OP's FR request was sufficiently clear, i.e. to specifically use %<>% as the combined-forward-pipe-and-assignment operator. Presumably this is because magrittr already defines %<>% for precisely this purpose. Since magrittr seems to be the dominant pipe implementation and many people seem to be using %<>%, my students and colleagues among them. It does not make sense to introduce another operator for the exact same purpose. It makes much more sense to choose a syntax that aligns to what the community has been exposed to or has already adopted. See my point about ubiquity.

Let me ask you, what do you hope to gain by introducing another operator that performs the exact same function in a different context? I can't see any benefit. Any operator you choose will be as arbitrary as magrittr's. So does it not make sense to make the whole system less arbitrary by simply following magrittr's lead here rather than make still another arbitrary syntactic decision?

On Thu, Oct 27, 2016 at 11:41 AM, franknarf1 notifications@github.com wrote:

I strongly encourage you to not overthink this FR by introducing a new operator whose choice is just as arbitrary magrittr's choice was..

@ctbrown https://github.com/ctbrown The proposal is for a pipe operator that does something different from the vanilla %>% from magrittr, a package that also has several other pipe operators. As long as it doesn't conflict with any of those, what's the problem?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/1208#issuecomment-256732710, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5xaxIp36dTUCz9d5a5mU7CJUV6PaxCks5q4PBDgaJpZM4FSJR5 .

MichaelChirico commented 7 years ago

Does %<>% assign by reference? If not, then they are in fact not doing the exact same thing.

ctbrown commented 7 years ago

Technically you are correct, magrittr's %<>% does not assign by-reference, but this is besides the point. Within users' expectations, there is no difference. The assignment whether by-reference or by-value is an implementation issue not an interface one. The OP has suggested adopted magrittr interface and did not necessarily suggesting the implementation. I seen the merit in the OP suggestion. See the reasons above. I do not see the rationale in adopting something like '%:>%or anything as arbitrary. The merit of this has not been articulated. The%<>%` operator already exists and is actively promoted by magrittr (12th most popular package according to METACRAN.) As much as anything this seems to be the standard (within the R community). The nice thing about following established practice is reducing user confusion and the need for comprehensive documentation. You get: "Oh, this is the same as magrittr, I know this forward pipe's and does an assignment", instead of "what is this strange %:>%? Is that a new clown emoticon?"

On Thu, Oct 27, 2016 at 2:18 PM, Michael Chirico notifications@github.com wrote:

Does %<>% assign by reference? If not, then they are in fact not doing the exact same thing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/1208#issuecomment-256772769, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5xa3ssE14PcamL2U9HlvCdfA8-7Iz8ks5q4RUagaJpZM4FSJR5 .

franknarf1 commented 7 years ago

Within users' expectations, there is no difference.

First, I do not think that is true and that you speak for all users; I am a user, for example. Second, if it is true, then these users should learn about the difference as they learn to use data.table.

Your "reasons above" do not hold water with me. It's not somehow going against magrittr to implement a non-overlapping pipe operator to do a distinct but related thing. To me -- and this is just my impression, just as much as everything you've been saying is yours -- this seems perfectly consistent with the "established practice" of magrittr (which I use almost as often as I use data.table).

It's perfectly possible that the use in this context would assign to multiple objects (columns) at once, which surely you would agree is quite distinct from %<>% ..? I mean

DT[, (cols) %:>% lapply(as.character) ]

And, besides modifying by reference and potentially modifying several things at once, we have the fact that we are modifying part of a thing (the data.table), which is quite different from %<>%.

Anyways, since the developers have not shown any sign of doing this task any time soon (by marking this FR with a priority or milestone), how about revisiting this if it actually moves forward?

jangorecki commented 7 years ago

@ctbrown by-reference is not just different in implementation, and needs to be differentiated from by-value functions in user interface. That's the whole point of set* functions and := operator in data.table, to clearly communicate to users what is actually modifying the input of a function. Hard to judge on a standard within R community after such a short while, R applications are being written for decades and it is quite too early to judge on "new standard", which in the end is AFAIK about code formatting (nesting/unnesting), please correct me if I'm wrong. As I said many times before I found magrittr pipes really nice for interactive use when I want to present chunk of code, but not really necessary when writing R packages where the main focus is functionality. IMO if something can modified in-place it has to have different operator then the one that won't modify in-place.

ctbrown commented 7 years ago

@franknarf1,

First, there was no claim as to speaking for ALL R users. That is a ridiculous assertion. The reference to "users's expectations" was to my own, presumably the OP's and several of my students who have already tried dt[ , var %<>% ... ] and ask why it does not worked. Further, Forcing users to "learn about the difference as they learn to use data.table" is dangerous if DT can work as one might expect. This leads to bad software design.

Second, as you point out, it is up to the individual to accept or reject arguments in favor of following the OP's suggestion of following the magriitr syntax. There have been some arguments offered as to why this would be beneficial, but few cogent argument offered why an alternative would be superior or even beneficial. There seems to be a minor argument that because the implementation is different, but that is a rather weak argument.

Additionally,

If there are arguments for/against the OP's suggestion, I would love to hear them. But the only thing I have heard is that, "because it is different". Maybe some will see this as valid, but weighed against the OP original suggestion, the alternative does not seem better.

MichaelChirico commented 7 years ago

Everyone that's ever used data.table has had to learn at sometime or another about using := (probably very soon after starting). Oh no, spooky! Why isn't it <- or =?

The answer to this is one of the first things anybody learns about using data.table. It's the topic of the second intro vignette.

%<>% vs. %<:>% (or whatever it may end up being) is exactly the same distinction. So the answer is covered by Matt here:

http://stackoverflow.com/questions/7033106/why-has-data-table-defined-rather-than-overloading

ctbrown commented 7 years ago

@jangorecki

First, most users do not need to know the distinguish the difference between by-reference and by-value. It is not a prerequisite of using DT that you know this. Presumably, this is why the DT syntax is so close to DF. @mattdowle could have clearly designed DT with a purely functional interface. He didn't. Presumably, one of reasons was that DT could function as a drop in replacement.

With respect to the set* functions, these may indicate by reference, but it is curious to note they were not named set*ByRef which would have been more clear. The functions seem to exist mostly for performing an efficient operation, turning a DF into DT and setting a key. That they can be taken to indicate a by-reference operation seems secondary.

As to :=, I think I recall @mattdowle being asked at useR UCLA why he used := instead of =. IIRC, he said he couldn't use = and := was available. IIRC, he would have preferred using =.

WRT, the standard in the R community -- notorious for it's lack of standards -- magrittr is as good as it gets: ubiquitously used and discussed. The OP suggests Interoperability with it would be a nice feature. I agree. If you have any doubts about this take a look at its CRAN page. Developers are using magrittr in their own packages. Moreover, writing packages is not the majority of R users. But this is really a digression from the topic.

The argument you offer falls under: "DT is different from magrittr since the assignment is by reference so the syntax is different". To which the response is still: The implementation is different, true. but the interface should be the same since it is effectively the the same operations for most users, conforms to user expectation and whose true operation can be inferred from context.

jangorecki commented 7 years ago

@mattdowle could have clearly designed DT with a purely functional interface. He didn't.

I'm glad he didn't. Locking into "purely functional" simply translates to dropping some important features that user is now capable to use in order to write faster and more memory efficient code. I have projects (i.e. anchormodeling) which would basically be impractical to use in a "purely functional" framework.

ctbrown commented 7 years ago

@jangorecki I totally agree. But we are starting to digress from the original proposal to the merits of DT.

ctbrown commented 7 years ago

@MichaelChirico

Thanks for bringing a sense of enlightenment to the discussion. The references stray a bit from the original proposal, but they help illustrate the points in favor or the OP proposal, Namely,

franknarf1 commented 7 years ago

First, there was no claim as to speaking for ALL R users. That is a ridiculous assertion. The reference to "users's expectations" was to my own, presumably the OP's and several of my students

It is a counterproductive and distracting rhetorical device, I'd say, to refer to "users" when you really just mean yourself. You may also have noticed that the OP said "%:>% looks good to me."

The action or inaction of the developers is another red herring and do not relate to the merits of the OP's proposal / request. This is a rather poor appeal-to-authority argument that has not really offered an opinion either way.

It is not an appeal to authority since I am not arguing a point there. It as an appeal to you to calm down. This may never even be implemented, so can't you defer the fuss? I imagine it will be a trivial matter to switch the name of the function after it's implemented (if it ever is), and we'll have a better sense of what exact functionality we're looking at at that point.

As far as the substantive arguments go:

I look forward to seeing your FRs for these features on https://github.com/tidyverse/magrittr/issues and hope they go through, because I would certainly use that functionality.

ctbrown commented 7 years ago

@franknarf1,

Point taken; I had missed that the OP said that "%:>% looks good to me."

Notwithstanding, it is not just me. The OP first suggested the magrittr syntax, first. Presumably, he thought it a good idea despite conceding to an alternative later. I had also thought it a good idea, that is what brought me here and this was prompted by several students who have tried it. Presumably, there are others. Dismissing this as a lone viewpoint is kinda beside the point, anyhow.

Second, the argument was, in fact, an appeal-to-authority. It may as also be "an appeal for me to calm down", though I am perfectly calm. In any event, the point seems off topic, it does not address the merits of the OP suggestion. Also, the fact that this is very unlikely to be implemented does not seem to be relevant to the merits of the proposal.

It must further concede that you are correct. It will be trivial to change the name of the function once implemented. However, such change could and will likely break any code that is developed that uses the feature. It makes perfect sense to spend time discussing the interface before implementing rather than burdening the users with a incompatible change later. It is unclear what shutting down discussion serve a useful purpose.

As to the substantive arguments, they seem to advocate more for increased functionality of magrittr than address proposed %<>% syntax. (On a personal note, I agree with you that the magrittr folks should implement your suggestions, especially the first. I am not sure if I would do use the second that much. ) Regardless of the proposal to the magrittr folks, there is nothing inconsistent from DT from adopting your enhancements and using the magrittr %<>% operator. And I have yet to really read and a cogent argument of the the superiority of %:=% (or an alternate) to %<>% .

ctbrown commented 7 years ago

In an effort to get back on topic, I thought it might be useful to summarize the relevant arguments.

Argument in favor of %:>%:

Arguments in favor of %<>%:

Tensibai commented 7 years ago

Moving code to-from dplyr/magrittr <--> DT would be simpler and easier, since in some cases the syntax may be similar.

You're absolutely missing how a modification by reference would awfully break this kind of ported code. Places where you did know your original object won't change will suddenly change because the assignment method change, this is really why it is important to distinguish the operators (and to keep the possibility of assigning by value (copy) in the same line of code also) .

That the same problem as when you copy a data.table vs copy a data.frame (dt2 <- dt), suddenly you scratch your head about why your orignal dt has been updated when you did work only on the second.

This exact precaution to take, invalidates also your first point, as it call for a precise documentation of what does the operator, using a different one will ease finding the correct documentation.

ctbrown commented 7 years ago

​@tensibai,

Understood. Thus the "may be" part of the assertion.​

On Wed, Nov 2, 2016 at 1:32 AM, Tensibai notifications@github.com wrote:

Moving code to-from dplyr/magrittr <--> DT would be simpler and easier, since in some cases the syntax may be similar.

You're absolutely missing how a modification by reference would awfully break this kind of ported code. Places where you did know your original object won't change will suddenly change because the assignment method change, this is really why it is important to distinguish the operators (and to keep the possibility of assigning by value (copy) in the same line of code also) .

That the same problem as when you copy a data.table vs copy a data.frame (dt2 <- dt), suddenly you scratch your head about why your orignal dt has been updated when you did work only on the second.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/1208#issuecomment-257801913, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5xaxyWXkh4C7i5-GtjMFiQ0AY2L5BLks5q6EqIgaJpZM4FSJR5 .

my-R-help commented 7 years ago

Thanks for all your comments and feedback.

Just a minor thing (maybe I'm missing something): Does it in this specific case even make a difference whether it's assigning by reference or by value? What we want to do is to update a column (or several columns) inside the data.table. The user knows that the old column will get overwritten either way. There's no room for misunderstanding, is there? In contrast, this is very different than DTa=DTb vs. DTa=copy(DTb) (which I'm not talking about in this feature request), where we're dealing with the data.table itself, and where it does make a difference whether we assign by reference or by value.

ctbrown commented 7 years ago

@my-R-help,

Your intuition is correct.

It does not make a difference to the user how this operation is implemented. From users' perspective the results are the same -- values in the column are reassigned. There has been some arguments stating that there should be some differentiation, there hasn't been a cogent explanation as to why.

Your proposal of adopting the %<>% syntax is a sound one. It correctly assumes the implementation is distinct from the interface and since there is an popular and extant practice for performing the operation, it should be adopted. This, in fact, follows good software design practice.

( As a side note, I was a little disheartened when you stated, "Thank (SIC), %:>% looks good to me." and did not more forcefully advocate for you initial intuition and proposal more forcefully. In any event, thanks for the proposal. It is brilliant whether it is implemented in DT or not,)

my-R-help commented 7 years ago

Thanks for your reply, Christopher.

Just to clarify, I personally have a preference for %<>% because it's more consistent with magrittr and a lot of people seem to be using it. However, if the data.table devs prefer another operator (e.g. %:>%), I can also live with that (although I personally prefer the magrittr way).

Maybe I should have phrased it that way. Sorry if it caused any confusion.

Tensibai commented 7 years ago

The user knows that the old column will get overwritten either way. There's no room for misunderstanding, is there?

It does not make a difference to the user how this operation is implemented. From users' perspective the results are the same -- values in the column are reassigned.

I still feel there's room for foot-gun with joins.

Having two operators behaving a little differently on their side effects named the same is error prone and will lead to confusion. I can't argue better than that, but there's a reason on why R warns you when a package mask a base function or when loading a package overload another package function.

In my opinion, it does make a difference for at least some users to have specific operators when the side effects will be different.

Bonus searching for the operator you'll end up on the DT page explaining it's caveats/limitations with no doubt instead of having two choices in the help.

Here we're talking about a language, not a user interface, while I agree on a final software user shouldn't care about the implementation behind X button, I highly disagree a programmer should not care about the implementation behind a function.

Major objection being: someone thinking %<>% will behave the same as outside a DT will turn crazy when it will scratch his DT columns when not intended.

TL;DR: Programming is not a UX, you have to be specific about what you want, hence reusing well-known names should not happen.

ctbrown commented 7 years ago

@Tensibai,

The claim that the side-effects are somehow different is dubious. In each case, a variable reassignment is being performed. They are both side-effects. The implementation (by-ref or by-value) doesn't truly distinguishes these since the comparative end states of both systems have changed in analogous ways.

Even if the side-effects are different. The distinction is rather unimportant. This point has been raised repeatedly in the above discussion. If the distinction were important, it should be possible to provide an example where it would make a difference to the user. The lack of a counter factual example while not conclusive is a strong indication that there is no distinction.

With respect to:

you have to be specific about what you want, hence reusing well-known names should not happen.

This is just wrong. Reuse of common, well-known names not only should happen, it is very common and is considered good programming practice. This is called polymorphism. It is perfectly acceptable to have methods with the same name that are implemented differently:

person.speak() "hello, world" dog.speak() "woof"

speak was used in each case. Is this bad practice? No. In fact, if polymorphism is not adopted it would be a disaster; every function and method would have its own name. While this is a fairly generic example based on OO languages, R is no different. R has S3 Methods and Generic Function that work in similar ways.

The suggestion that:

I highly disagree a programmer should not care about the implementation behind a function.

is similarly flawed and is counter to most users experience. Most programmers probably use hundreds of functions/methods. They do so without knowing their implementation details. The user does needs to know the input and the output/side-effects for the functions to be useful, but how it gets there is most often irrelevant. Granted, users sometimes needs to know details in order to tweak or debugged the function, but it can be argued that this in the vast minority of cases. Consider the world where the users had to know how each and every function worked at all levels. The cognitive load would be immense; programming anything of complexity would be an impossible task. With respect to:

Bonus searching for the operator you'll end up on the DT page explaining it's caveats/limitations with no doubt instead of having two choices in the help.

This is not a Bonus, but a liability by a) introducing confusion (how is this different from the very popular magrittr packages, exaclty?) and the b) creating the need for additional unneeded documentation in the first place. If the magrittr syntax, DT devs can say: "go there and read there docs and vignette; DT supports what they are doing there." This cooperation and cross package borrowing raises the value of DT, magrittr and the R ecosystem. )

Lastly, it might be inferred that from the comments about "user interface", "X button" and "UX" that there was a specific UI implied. That is simply not the case. And, while it is abundantly clear we are speaking about a language, it is erroneous to say that the language lacks an interface. The interface is its syntax and it is important.

jangorecki commented 4 years ago

To summarize, so the issue can be eventually resolved.

All we need is to handle the following translation.

DT[, a %<:>% fun] ## or "%:>%"

DT[, a := fun(a)]

Is that right?

how should it behave if a is not a symbol but character variable?

DT[, "a" %<:>% fun]

DT[, "a" := fun(a)]   ## this?
DT[, "a" := fun("a")] ## or this?

what if its length is not 1?

DT[, c("a","b") %<:>% fun]

DT[, c("a","b") %<:>% fun(a, b)]
DT[, c("a","b") %<:>% fun("a","b")]
DT[, c("a","b") %<:>% lapply(list(a, b), fun)]
DT[, c("a","b") %<:>% lapply(c("a", "b"), fun)]

Personally speaking I would close it as won't fix because of adding quite a lot complexity and not solving any new problem. I see agreement on that, thus closing, we can always re-open if really needed.