Open OfekShilon opened 1 year ago
I hesitate to comment here, because I'm a newbie to data.table
, to R's S3 class system, and to RStudio. But... maybe my perspective will be helpful to the core devteam of this package?
I wholeheartedly agree with #OfekShilon: I too "... don't expect methods of one class to operate on another - on the contrary, I expect the class system to help me by isolating their respective functionalities."
I tentatively suggest that the data.table
development effort adopt, as a long-term goal, that it become impossible to silently promote a data.frame
object to a data.table
object -- except by using DT
and perhaps fread()
. I'm reluctantly including fread()
here, because any future CRAN package which defines an fread()
method will become hazardous for use in conjunction with data.table
. For backwards compatibility, all existing automagic promotions of data.frame
objects to data.table
objects should (I suppose!) be allowed with a warning.
I'd also suggest the Introduction to data.table vignette be amended, so that it include a warning against manipulating data.table
objects in interpretive environments which don't include the data.table
library. I'm moderately confident that such careless manipulation was how I had once managed to create a data.table
object with a corrupted set of object-level attributes: my best guess is that I had done this by mistakenly invoking an undecorated save()
method (via command shell in RStudio) on a data.table
object that my codebase had created. Fortunately -- and I'm very grateful for this -- my subsequent load()
of this serialised object threw a warning which informed me of a dangling reference in its object-level attribute list.
I hope it is clear that the above are only suggestions from a newbie to R who isn't your usual sort of newbie -- because I have extensive experience with other languages, I prefer to develop in a strongly-typed language, I am happy to "hack" throwaway code in a weakly-typed scripting language, and my only prior experience with statistical programming was way back in the early 1990s -- when I used S and emacs! I'm hugely impressed with the current status of R. After my initial foray into emacs/ess, RStudio was a huge relief... although I'm still struggling with the basics e.g. I can't be bothered to figure out how to ensure thatdata.table
is loaded by default in my project environment. (Surely I shouldn't be hacking .Renviron but instead should be invoking some method in usethis
... but what is it called? I'm confident I could figure it out within a half-hour, maybe even faster -- but instead I find myself writing this overlong comment ;-)
Thanks to the lovely documentation by Hadley Wickham, and so many years of well-directed behind-the-scenes work by core R developers over the years, I'm now making great progress on my rev of vote_2.3.2!
@cthombor If I understand your intention correctly there is already an attempt to give something like "a warning against manipulating data.table objects in interpretive environments which don't include the data.table library" - check cedta, which is checked in the beginning of most (all?) data.table method calls.
This is a different matter but I do feel there is some commonality - maybe the underlying bias is to try and help everyone with everything they try to do with data.tables. Which is very generous, but often leads to fragile design.
Anyway you'd probably draw better attention for your problem if you open a new issue with concrete examples. Can you give examples of silent promotions of data.frames to data.tables?
@OfekShilon thanks for your response. I have no idea whether there are any silent promotions of data.frames to data.tables; but their possible existence is one of the (many!) gaps in my knowledge which makes it difficult for me to discover the root cause(s) of the invalid .internal.selfref
warnings I have triggered while developing.
Another gap in my knowledge: does data.table
export a method for testing the .internal.selfref
sentinel? AFAIK the answer is "no". Section 5.13 of Writing R Extensions tells me that it's not something that can be written in R. I'm asking because it'd not be very onerous for me to find a root cause for a warning -- and to develop an MRE if it turns out to be a novel cause -- by pepper-potting my code with stopifnot(valid.internal.selfref(DT))
calls along the path to the operation on DT
which produced the warning message.
And yes I believe we're on the same page with the difficulties (both to users and to its devs!) of supporting use-cases for data.table
that are outside what (I now, belatedly) understand to be its central set of use-cases -- the analysis of very large datasets of well-established provenance. I am now almost certain that a data.table
is not, and can never be, an appropriate structure for the collection and archival storage of experimental data. The secondary factors of an experimental dataset must be reliably recorded and preserved in archival storage. During analysis -- especially if there's a conveniently-accessible archival store of experimental datasets -- there's no strong data-integrity requirement on the secondary factors which have been copied into the object-level attributes of a data.table
which contains a copy of the archival record. That's my current thinking anyway: that the tidyverse might "someday" include a data-collection package that's distinct from data.table
. Rather orthogonal really: for analysis of experimental data in the tidyverse you want the experimental units indexing the rows, and the primary factors indexing the columns. Given R's column-major storage: for the collection of experimental data you want the primary factors indexing the rows and the experimental units indexing the columns!
@OfekShilon thanks for the pointer to cedta. However I won't expect every base method which could copy (or corrupt) a data.table
to include a cedta()
call; so the existence of cedta()
doesn't reduce the hazard of my foolishly manipulating a data.table
with a base method which copies it and therefore puts the integrity of my DT at risk. But... returning to the issue at hand -- yes I strongly support your idea of removing (what I understand to be) the silent (even if only temporary!) promotion of a data.frame
object to a data.table
object.
The cheat sheet for DT lists only two ways to create a data.table
. Converting from a data.frame
is easy-as... so I guess the body of that DT
test could "easily" be revised to EVAL("D |> setDT() |> DT(,.SD)"), mtcars)
but... I'm back to wondering about the feasibility of a transition in which the silent promotion you had identified in [.data.table
, and of any other silent promotions which may exist in the data.table codebase, are deprecated with a warning.
Warnings can be very confusing to a newbie like me -- indeed quite distressing, if it's unclear what one can do to reliably avoid triggering such warnings in the future -- but I doubt any of my stumblings with .SD syntax and semantics would have included an attempt to use a data.frame as a value for .SD, so deprecating this particular promotion with a warning wouldn't have bothered me.
Hmmm... maybe a search for is.data.table
in the data.table
codebase would put a quick upper-bound on the number of its methods which can -- at least under some conditions -- silently promote an R object to a data.table
?
@cthombor
Another gap in my knowledge: does
data.table
export a method for testing the.internal.selfref
sentinel? AFAIK the answer is "no". Section 5.13 of Writing R Extensions tells me that it's not something that can be written in R. I'm asking because it'd not be very onerous for me to find a root cause for a warning -- and to develop an MRE if it turns out to be a novel cause -- by pepper-potting my code withstopifnot(valid.internal.selfref(DT))
calls along the path to the operation onDT
which produced the warning message.
You can use the non-exported data.table:::selfrefok
.
data.table
does many things that can't be done in vanilla R by invoking C code. The key one in this context is address
.
While experimenting with a potential fix for #5286 this test of
DT
surfaced:This line in
[.data.table
:Seems to indicate that the desired behavior is to return a data.table. However
ans
isn't the variable that is returned directly, later on the code restores the class to data.frame before returning:The returned object has class data.frame but still has data.table attributes (namely
.internal.selfref
).Before suggesting fixes I believe some design decisions need to be explicitly discussed:
DT
operates freely on non-data-tables. The only discussion I could find of it is this single comment, and I'm not sure this is the best design.[.data.table
, as the name implies, is a method of data.table. It might accidently work in some (even most) scenarios on data.frames, but is a hefty piece of code that was designed to use data.table attributes and trying to open it would mean an avalanche of bugs and a world of pain. In general, as a user I don't expect methods of one class to operate on another - on the contrary, I expect the class system to help me by isolating their respective functionalities.In particular, rejecting non-data.table's would make #5286 (and probably others, both current and future) moot.
[.data.table
and other package methods I'd suggest usingsetDF
orsetDT
instead of settingclass
directly - assetDF
cleans up data.table-only attributes, that manual-class-set misses.