Open arunsrinivasan opened 9 years ago
fread
is a least worth to mention.
Above points are related mainly to data transformation, fread
is more a data extraction so it might be skipped in such vignette, yet IMO it is worth to mention such data.table
capabilities.
edit: which one you are going to use: Rnw or Rmd?
Agreed, and updated.
I'm curious about what makes a cold by faster than say tapply
. One part of the answer is gforce, but what about user written functions? I could not find anything about this. There's a nice post about panda : http://wesmckinney.com/blog/?p=489
One could even compare it with sapply
. For instance, suppose I start from a list of vectors. Is it ever worth it to append all the vectors in one column in a data.table and use by
instead of sapply
?
@matthieugomez interesting question! Would be nice to cover this as well. Keep'em coming :-).
I would be interested to learn about IDateTime and some of the use cases for it.
@gsee updated.
Being new to R and data.table (since March), I would say that there needs to be a basic outcome-oriented introduction as opposed to the current function-oriented one. In other words, it is one thing to read what each parameter in data.table does, but they often make little sense without having a use-case in mind. While there are examples of output, many people need to go the other direction. That is, they know what output they need, but they don't know what function/parameter/setting is most appropriate to use. It would be helpful to have a simple recipe approach to get them started.
How to I create subsets of my data? How do I do an operation on subsets of my data to create a new or updated data set? How do I add a new column? How do I delete a column? How do I create a single variable? How do I create multiple variables? How do I do different operations on different subsets of my data? (.BY) How do I use data.table in a function and pass in data.table names and columns on which to operate? How do I do multiple sequential operations on the same data.table? Can I select a subset of data and do an operation on it at the same time? When do I need to be careful about creating/updating variables by reference? How do I select one observation per group (first, last)? How do I set a key and how is it different from setting an index? Under what conditions does my key get deleted when I do an operation on my data.table? Can I just use the regular "merge" syntax or do I need to use data.table syntax (Y[X])? How do I collapse a list of lists into one big data.table? What if the columns are in different order?
There are probably a ton of other items all on SO that could be edited into a simple compilation of questions and answers.
@markdanese Thanks for your suggestions. These are all great to have, but probably as a separate wiki, as they're very particular about certain tasks. The objective of the vignette is to introduce to data.table syntax illustrating how flexible and powerful it can be, so that you are able to do these tasks yourself.
I'm writing the vignettes now (as fast as I can), and the format is more or less in this fashion (Q&A) and explaining the answer with an example. Once I've the first vignette polished, I intend to post it here to get some feedback.. It'd be great to know what you think as well.
Thanks again.
Further extension of the idea of the wiki page: The FAQ and Code Fragments (Advanced) links listed on http://www.ats.ucla.edu/stat/r/ might be a useful resource for contrasting traditional tasks in R with data.table way. I did something like this in a blog post (http://vijaylulla.com/wp/2014/11/12/grouping-in-r-using-data-table/) to show it to my colleague. Sorry for the shameless self promotion.
I've finished the Introduction to data.table vignette (see link on top). It'd be great to know what you think.
Thanks to @jangorecki and @brodieG for great feedbacks, and of course @mattdowle :-).
This is really great. Wish it existed a year ago when I started using data.table. A couple of small things below for your consideration:
You might want to mention that you can sort (via order
) in i
in your summary at the end. You could also want to mention this at the beginning. You could also mention that there are more sophisticated joins that can be done in i
involving keys that are not covered. That allows you to mention the main functions of i
so the reader can look for more advanced functions if they need them. And you can hyperlink to them later.
In the .SD section you write "that group" but it might be more clear to say "that group defined using by
". This is also done a little later as well.
I might have missed it, but it would be good to be a little more clear that .SD
with by
essentially limits the data to the .SD columns and then creates a set of data.tables for each unique combination of the variables in the by
. It then processes these data.tables in the order of the by
variables using the function(s) from j
. You could even mention that there are special symbols that allow users to access some of the indexes generated as part of that processing, but that these are beyond the scope of the introduction vignette.
Again, these are just suggestions. Your hard work (and Matt's) is greatly appreciated.
On Sat, Jan 17, 2015 at 6:22 PM, Mark Danese notifications@github.com wrote:
This is really great. Wish it existed a year ago when I started using data.table. A couple of small things below for your consideration:
Thank you.
You might want to mention that you can sort (via order) in i in your summary at the end. You could also want to mention this at the beginning.
Oh snap! Great point. I should add "order(..)" at the very beginning, and will add to summary as well.
You could also mention that there are more sophisticated joins that can be done in i involving keys that are not covered. That allows you to mention the main functions of i so the reader can look for more advanced functions if they need them.
Right, will do.
And you can hyperlink to them later.
That, I'm not sure.. as these are meant to be pushed to CRAN, and as well on the WIKI..
In the .SD section you write "that group" but it might be more clear to say "that group defined using by". This is also done a little later as well.
I thought I edited it out to "the current group", but apparently not.. "by the current group, defined using in
by
" - how does that sound?I might have missed it, but it would be good to be a little more clear that .SD with by essentially limits the data to the .SD columns and then creates a set of data.tables for each unique combination of the variables in the by. It then processes these data.tables in the order of the by variables using the function(s) from j.
I think you missed it. It is right underneath the block quote where .SD is explained (in section 2e). And it explains exactly what you mention here...
You could even mention that there are special symbols that allow users to access some of the indexes generated as part of that processing, but that these are beyond the scope of the introduction vignette.
Right.. that's the reason for not introducing other special symbols.
Again, these are just suggestions. Your hard work (and Matt's) is greatly appreciated.
Great suggestions. I'll write back once I've the other vignettes uploaded.
— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/944#issuecomment-70375167 .
That, I'm not sure.. as these are meant to be pushed to CRAN, and as well on the WIKI..
AFAIK when you push package to CRAN which includes Rmd
in vignettes
directory they will be automatically build to check if build vignette succeed, but the source code in CRAN will contain vignettes (html) already built by you, not the one from CRAN build/check.
CRAN is a good place for vignettes as for many users it is the first place to seek for docs/tutorials so I think it is worth to have them in CRAN.
And you can hyperlink to them later.
That, I'm not sure.. as these are meant to be pushed to CRAN, and as well on the WIKI..
Don't single folders links work on CRAN? I haven't actually put anything up there, but this vignette links multiple others in the same folder by using relative links and works fine from R (obviously the link is not from R, but if you install the package and run the vignettes the links work.
Updated with Reference Semantics vignette.
thanks again for doing all of this.
just one other suggestion on something to cover on a vignette -- using data.table inside your own function. not writing a package, but just trying to automate some common tasks. there are some tricks that I have not quite figured out. also if there is a post somewhere on this topic, a link would be appreciated.
finally, a vignette listing "useful" stack overflow posts might be helpful for topics you don't want to include in a vignette.
just some random thoughts.
Two thoughts :
Thanks.
@arunsrinivasan Nope, I hadn't seen that, great! Another bookmark
Updated with Keys and fast binary search based subset vignette.
Very nice. I love these vignettes. Just some quick comments for consideration.
What is the purpose of taking over row names if they are not used? Or are they used by the special operators in j (like .N, .I, etc.)? I think they are used by data.table, but just not as indices. I have always been confused by the purpose of forcing the numbered row names.
Why use unique
in the first key when accessing only the second? If you don't, you get a lot of repeated rows in the output, right? Maybe obvious, but it might be helpful to say/show what happens if you don't.
Do all keys need to be quoted? Even numeric (integer) ones? Can you use a numeric as a key? Any things to watch out for?
What if your key column has NA in it? Can you search for those and replace them (as you did in your example where your replaced 24 with 0?
It might help to explain that keyby
applies to the output data.table (ans
in your example) and not the input data.table (flights
in your example).
Can you pass a vector to the key? In other words, can you create airport <- c("LGA", "JFK", "EWR")
and use airport
directly in i
in your example near the bottom? This might help set up the idea of passing a different data.table in for a merge.
Typo on "corresponding" ("correspondong"). One of the back ticks is missing in the vector scan section where you writing "The row indices corresponding to origin == "LGA" anddest == “TPA”` are obtained using key based subset."
@markdanese regarding the
Why use unique in the first key when accessing only the second?
flights[.(unique(origin), "MIA")]
Not sure if you very asking to suggest better explanation or you are not aware of more complex usage of multiple column key.
You cannot simply use binary search on dest
when your key is c(origin, dest)
, you should have c(dest, origin)
to use binary search on dest
. Using .(unique(origin), "MIA")
uses binary search, by providing all available values for the first column in key and then selective values to second column.
I've made an extension to use only selective columns from key. looking at the simple example may also help you to understand. My extension is not ready to be PR to data.table master as the memory usage does not scale as good as it could if developed using internal data.table functions / combined with data.table secondary key.
Can you use a numeric as a key?
You can use numeric as key, it is mentioned in Keys and their properties
section.
Any things to watch out for?
Not sure but setNumericRounding
affects the numeric key, might be worth to mention in the vignette.
What if your key column has NA in it? Can you search for those and replace
Yes, the is.na()
is optimized to use binary search. Try data.table(a=c(1,NA_real_),b=c("a","b"),key="a")[.(NA_real_), .SD ,verbose=TRUE]
Also to @arunsrinivasan, the typo in:
find the matching vlaues in
Thanks Jan -- that is really helpful. I offered those questions as things that could briefly be mentioned in the vignette to help new users understand what is going on. They were things that came to mind (as a fairly new user) while reading the documentation. I can't really contribute to the code, so I am hoping to contribute by helping with the documentation.
On Fri, Jan 23, 2015 at 8:48 PM, Mark Danese notifications@github.com wrote:
Very nice. I love these vignettes. Just some quick comments for consideration.
What is the purpose of taking over row names if they are not used? Or are they used by the special operators in j (like .N, .I, etc.)? I think they are used by data.table, but just not as indices. I have always been confused by the purpose of forcing the numbered row names.
Section 1a, just above Keys and their properties has answer to this. Data.tables inherit from data.frames.
Why use unique in the first key when accessing only the second? If you don't, you get a lot of repeated rows in the output, right? Maybe obvious, but it might be helpful to say/show what happens if you don't.
Again, this is explained exactly underneath in "what's happening here?". I even refer to the previous section where I lay the groundwork for explaining this one.
Do all keys need to be quoted? Even numeric (integer) ones? Can you use a numeric as a key? Any things to watch out for?
There's an example with integer columns on 2d. I thought that was sufficient?
What if your key column has NA in it? Can you search for those and replace them (as you did in your example where your replaced 24 with 0?
Good point. That's a difference with vector scan. Will try to add this.
It might help to explain that keyby applies to the output data.table ( ans in your example) and not the input data.table (flights in your example).
'keyby' was already discussed in the first vignette. But I'll see if this can be added.
Can you pass a vector to the key? In other words, can you create airport <- c("LGA", "JFK", "EWR")and useairportdirectly ini` in your example near the bottom? This might help set up the idea of passing a different data.table in for a merge.
Content for next section. That is how we transition into joins.
Typo on "corresponding" ("correspondong"). One of the back ticks is missing in the vector scan section where you writing "The row indices corresponding to origin == "LGA" anddest == “TPA”` are obtained using key based subset."
Thanks.
— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/944#issuecomment-71253738 .
Great work on these vignettes! My comments may be late or already covered:
set
. Also, it would be nice to see an explanation why the following does give an error (see here ):for (j in valCols)
set(dt_,
i = which(is.na(dt_[[j]])),
j = j,
value= as.numeric(originTable[[j]]))
Added Reshape vignette to Wiki.
Excellent functionality and vignette! Thanks Arun
On Tue, Jun 23, 2015, 21:02 Arun notifications@github.com wrote:
Added Reshape vignette https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-reshape.html to Wiki https://github.com/Rdatatable/data.table/wiki/Getting-started.
— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/944#issuecomment-114678716 .
man for patterns
would be good. Great vignette
Isn't reshape2 required to be loaded to use these commands? If so, then that should be mentioned. I really like the focus on "wide to long" and "long to wide". I absolutely hate the syntax of reshape2 (for example, I think "make_wide" is much more clear than "dcast"). For this reason, I would not write the section headers as "melting data.tables" and "casting data.tables". That only make sense for people who are familiar with the reshape2 package. I might begin with headers that are more universal as above ("long to wide").
For what it is worth, I can't get the first line of the vignette to run using a fresh R session with just data.table loaded. I have no idea why (maybe mode should be "w" and not "wb"), but
DT = fread("https://raw.githubusercontent.com/wiki/Rdatatable/data.table/data/melt_default.csv")
returns
Error in download.file(input, tt, mode = "wb") : unsupported URL scheme
As always, thanks for doing this. It is really useful.
@markdanese thanks for the excellent feedback.
reshape2
won't be required from data.table v1.9.6
. Updated this in the vignette as well.https
functionality in fread
is implemented in the devel version. So you won't be able to run that code yet with v1.9.4
. Either update, or wait a bit :-).Thanks for your encouragement.
@jangorecki patterns()
won't be exported. The usage will be expanded for [.data.table
to be used for selecting columns, :=
, .SDcols
etc..
@arunsrinivasan still the manual for patterns
may help, the same way there is one for :=
. Just because many people (I think) use ?fun
to understand the code they read.
In the join vignette it may be worth to add corresponding SQL examples of data.table joins so it can be easier to pickup for db guys. Examples of corresponding SQL statement can be found for example in SO How to join (merge) data frames (inner, outer, left, right)?.
Would also be cool to have some "Refugees" vignettes --
data.table
for Stata
usersdata.table
for SQL
usersdata.table
for Matlab
usersdata.table
for Python
/pandas
usersdata.table
for dplyr
usersetc. Like a quick-start guide, but oriented towards emigrees.
Added Secondary indices and auto indexing
vignette. This should allow smooth transition from subsets to joins for the next vignette I'll work on.
@arunsrinivasan isn't more appropriate to not use secondary in relation to indices? it was used for keys where it was important. Now seems to be redundant once we switch to index naming.
@jangorecki I think "secondary" is useful for its relation to keys (primary), perhaps:
Secondary sorting
Is a better description?
but already the index word has been used, it looks nicer than secondary sorting :)
So you would just name it "auto indexing"? IMO "secondary sorting and auto indexing" feels more informative
auto can be somehow misleading, as indexes should works for auto creating index, and also for use of manually created indexes - #1422 address current limitation in that matter.
I see. I'm still missing your preferred alternative -- just "Indices"?
not perfect but preferred over secondary indices
I like this latest vignette a lot. My only thought was that it might be helpful to mention what types of operations cause the index to be dropped. From my testing, it seems pretty much anything that changes the number of rows, or any operation involving the indexed column.
I thought the examples of "on" were really helpful.
@markdanese good point, will add.
Thank you for the updated vignettes with the release of v1.9.8.
The "Reference semantics" refers to the copy()
function and its new capabilities to make shallow copies (especially inside functions, something that I am really interested in):
"However we could improve this functionality further by shallow copying instead of deep copying. In fact, we would very much like to provide this functionality for v1.9.8. We will touch up on this again in the data.table design vignette."
But the design vignette is missing and the link points to an old issue. The reference manual does not provide more information on copy()
than the one provided in the vignette. The rest of the vignettes do not provide any information on copy
.
Will this vignette become available soon?
+1 for internals vignette. I (and I guess a few others) am quite interested in contributing a bit on the C side of things, but am a bit intimidated by the (as it stands) 35k lines of C code... quite the learning curve to 'go it alone' -- an intro to internals could do wonders!
For Joins vignette:
Wanted to chime in and ask if contributions to the vignette are accepted from non-code contributors (like me). I am particularly interested in contributing to the joins vignette as I had quite a bit of trouble with it initially and was guided to solutions from Arun's answers on Stackoverflow, and I'd like some guidance on how to do so, if allowed.
@arunsrinivasan I see that you have a point IDateTime vignette
. Perhaps it could be included in the more general vignette suggested by @jangorecki: vignettes: timeseries - ordered observations?
In addition, I am preparing a first draft on some of the topics suggested by jan. Perhaps parts of it may be relevant for a join vignette as well? I'm happy to share if anyone may find it useful.
@zeomal such a contribution would be highly valuable and much appreciated!
@MichaelChirico, thank you. @Henrik-P, will your brief on normal joins be comprehensive - i.e. will your focus be more on timeseries? If not, I can start work on it - I haven't used rolling joins yet, so no knowledge there. :)
HTML vignette series:
Planned for
v1.9.8
i.col
usage as filed in #1038. d) Also cover about performance/advantages from #1232.[ ] Covercovered in #4304get()
andmget()
. E.g., http://stackoverflow.com/q/33785747/559784Future releases
fread
+rbindlist
), ordering, ranking and set operationsdata.table()
anddata.frame()
somewhere - relevant issues: #968, #877. Perhaps slightly more in detail in the FAQ.data.table
usage:fread+fwrite
vignette, include also Convenience features of fread wiki, also https://github.com/Rdatatable/data.table/issues/2855Finished:
i
, select / do inj
and aggregations usingby
.i
andby
in the same way as before)by=.EACHI
until the vignette is done.Minor:
integer64
, and promoting it for large integers.Notes (to update current vignettes based on feedbacks): Please let me know if I missed anything..
Introduction to data.table:
order
ini
.j
while selecting/computing..SDcols
and cols inwith=FALSE
being able to select columns ascolA:colB
.Reference semantics:
set*
functions here.. (setnames
,setcolorder
etc..)set
.1b) the := operator
is just defining ways to use it - the example there doesn't work as it just shows two different ways of using it -- Following this comment.Keys and fast binary search based subsets:
FAQ (most appropriate here, I think).
readRDS()
. Update this SO post.alloc.col()
, and when to use it (when you need to create multiple columns), and why. Update this SO post.