Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 982 forks source link

Vignettes #944

Open arunsrinivasan opened 9 years ago

arunsrinivasan commented 9 years ago

HTML vignette series:

Planned for v1.9.8


Future releases


Finished:


Minor:


Notes (to update current vignettes based on feedbacks): Please let me know if I missed anything..

Introduction to data.table:

jangorecki commented 9 years ago

fread is a least worth to mention. Above points are related mainly to data transformation, fread is more a data extraction so it might be skipped in such vignette, yet IMO it is worth to mention such data.table capabilities.

edit: which one you are going to use: Rnw or Rmd?

arunsrinivasan commented 9 years ago

Agreed, and updated.

matthieugomez commented 9 years ago

I'm curious about what makes a cold by faster than say tapply. One part of the answer is gforce, but what about user written functions? I could not find anything about this. There's a nice post about panda : http://wesmckinney.com/blog/?p=489 One could even compare it with sapply. For instance, suppose I start from a list of vectors. Is it ever worth it to append all the vectors in one column in a data.table and use by instead of sapply ?

arunsrinivasan commented 9 years ago

@matthieugomez interesting question! Would be nice to cover this as well. Keep'em coming :-).

gsee commented 9 years ago

I would be interested to learn about IDateTime and some of the use cases for it.

arunsrinivasan commented 9 years ago

@gsee updated.

markdanese commented 9 years ago

Being new to R and data.table (since March), I would say that there needs to be a basic outcome-oriented introduction as opposed to the current function-oriented one. In other words, it is one thing to read what each parameter in data.table does, but they often make little sense without having a use-case in mind. While there are examples of output, many people need to go the other direction. That is, they know what output they need, but they don't know what function/parameter/setting is most appropriate to use. It would be helpful to have a simple recipe approach to get them started.

How to I create subsets of my data? How do I do an operation on subsets of my data to create a new or updated data set? How do I add a new column? How do I delete a column? How do I create a single variable? How do I create multiple variables? How do I do different operations on different subsets of my data? (.BY) How do I use data.table in a function and pass in data.table names and columns on which to operate? How do I do multiple sequential operations on the same data.table? Can I select a subset of data and do an operation on it at the same time? When do I need to be careful about creating/updating variables by reference? How do I select one observation per group (first, last)? How do I set a key and how is it different from setting an index? Under what conditions does my key get deleted when I do an operation on my data.table? Can I just use the regular "merge" syntax or do I need to use data.table syntax (Y[X])? How do I collapse a list of lists into one big data.table? What if the columns are in different order?

There are probably a ton of other items all on SO that could be edited into a simple compilation of questions and answers.

arunsrinivasan commented 9 years ago

@markdanese Thanks for your suggestions. These are all great to have, but probably as a separate wiki, as they're very particular about certain tasks. The objective of the vignette is to introduce to data.table syntax illustrating how flexible and powerful it can be, so that you are able to do these tasks yourself.

I'm writing the vignettes now (as fast as I can), and the format is more or less in this fashion (Q&A) and explaining the answer with an example. Once I've the first vignette polished, I intend to post it here to get some feedback.. It'd be great to know what you think as well.

Thanks again.

vlulla commented 9 years ago

Further extension of the idea of the wiki page: The FAQ and Code Fragments (Advanced) links listed on http://www.ats.ucla.edu/stat/r/ might be a useful resource for contrasting traditional tasks in R with data.table way. I did something like this in a blog post (http://vijaylulla.com/wp/2014/11/12/grouping-in-r-using-data-table/) to show it to my colleague. Sorry for the shameless self promotion.

arunsrinivasan commented 9 years ago

I've finished the Introduction to data.table vignette (see link on top). It'd be great to know what you think.

Thanks to @jangorecki and @brodieG for great feedbacks, and of course @mattdowle :-).

markdanese commented 9 years ago

This is really great. Wish it existed a year ago when I started using data.table. A couple of small things below for your consideration: You might want to mention that you can sort (via order) in i in your summary at the end. You could also want to mention this at the beginning. You could also mention that there are more sophisticated joins that can be done in i involving keys that are not covered. That allows you to mention the main functions of i so the reader can look for more advanced functions if they need them. And you can hyperlink to them later.

In the .SD section you write "that group" but it might be more clear to say "that group defined using by". This is also done a little later as well.

I might have missed it, but it would be good to be a little more clear that .SD with by essentially limits the data to the .SD columns and then creates a set of data.tables for each unique combination of the variables in the by. It then processes these data.tables in the order of the by variables using the function(s) from j. You could even mention that there are special symbols that allow users to access some of the indexes generated as part of that processing, but that these are beyond the scope of the introduction vignette.

Again, these are just suggestions. Your hard work (and Matt's) is greatly appreciated.

arunsrinivasan commented 9 years ago

On Sat, Jan 17, 2015 at 6:22 PM, Mark Danese notifications@github.com wrote:

This is really great. Wish it existed a year ago when I started using data.table. A couple of small things below for your consideration:

Thank you.

You might want to mention that you can sort (via order) in i in your summary at the end. You could also want to mention this at the beginning.

Oh snap! Great point. I should add "order(..)" at the very beginning, and will add to summary as well.

You could also mention that there are more sophisticated joins that can be done in i involving keys that are not covered. That allows you to mention the main functions of i so the reader can look for more advanced functions if they need them.

Right, will do.

And you can hyperlink to them later.

That, I'm not sure.. as these are meant to be pushed to CRAN, and as well on the WIKI..

In the .SD section you write "that group" but it might be more clear to say "that group defined using by". This is also done a little later as well.

I thought I edited it out to "the current group", but apparently not.. "by the current group, defined using in by" - how does that sound?

I might have missed it, but it would be good to be a little more clear that .SD with by essentially limits the data to the .SD columns and then creates a set of data.tables for each unique combination of the variables in the by. It then processes these data.tables in the order of the by variables using the function(s) from j.

I think you missed it. It is right underneath the block quote where .SD is explained (in section 2e). And it explains exactly what you mention here...

You could even mention that there are special symbols that allow users to access some of the indexes generated as part of that processing, but that these are beyond the scope of the introduction vignette.

Right.. that's the reason for not introducing other special symbols.

Again, these are just suggestions. Your hard work (and Matt's) is greatly appreciated.

Great suggestions. I'll write back once I've the other vignettes uploaded.

— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/944#issuecomment-70375167 .

jangorecki commented 9 years ago

That, I'm not sure.. as these are meant to be pushed to CRAN, and as well on the WIKI..

AFAIK when you push package to CRAN which includes Rmd in vignettes directory they will be automatically build to check if build vignette succeed, but the source code in CRAN will contain vignettes (html) already built by you, not the one from CRAN build/check. CRAN is a good place for vignettes as for many users it is the first place to seek for docs/tutorials so I think it is worth to have them in CRAN.

brodieG commented 9 years ago

And you can hyperlink to them later.

That, I'm not sure.. as these are meant to be pushed to CRAN, and as well on the WIKI..

Don't single folders links work on CRAN? I haven't actually put anything up there, but this vignette links multiple others in the same folder by using relative links and works fine from R (obviously the link is not from R, but if you install the package and run the vignettes the links work.

arunsrinivasan commented 9 years ago

Updated with Reference Semantics vignette.

markdanese commented 9 years ago

thanks again for doing all of this.

just one other suggestion on something to cover on a vignette -- using data.table inside your own function. not writing a package, but just trying to automate some common tasks. there are some tricks that I have not quite figured out. also if there is a post somewhere on this topic, a link would be appreciated.

finally, a vignette listing "useful" stack overflow posts might be helpful for topics you don't want to include in a vignette.

just some random thoughts.

juancentro commented 9 years ago

Two thoughts :

arunsrinivasan commented 9 years ago

Thanks.

  1. Have you seen this?
  2. That'll most likely be covered in a separate vignette. But no plans yet.
juancentro commented 9 years ago

@arunsrinivasan Nope, I hadn't seen that, great! Another bookmark

arunsrinivasan commented 9 years ago

Updated with Keys and fast binary search based subset vignette.

markdanese commented 9 years ago

Very nice. I love these vignettes. Just some quick comments for consideration.

What is the purpose of taking over row names if they are not used? Or are they used by the special operators in j (like .N, .I, etc.)? I think they are used by data.table, but just not as indices. I have always been confused by the purpose of forcing the numbered row names.

Why use unique in the first key when accessing only the second? If you don't, you get a lot of repeated rows in the output, right? Maybe obvious, but it might be helpful to say/show what happens if you don't.

Do all keys need to be quoted? Even numeric (integer) ones? Can you use a numeric as a key? Any things to watch out for?

What if your key column has NA in it? Can you search for those and replace them (as you did in your example where your replaced 24 with 0?

It might help to explain that keyby applies to the output data.table (ans in your example) and not the input data.table (flights in your example).

Can you pass a vector to the key? In other words, can you create airport <- c("LGA", "JFK", "EWR") and use airport directly in i in your example near the bottom? This might help set up the idea of passing a different data.table in for a merge.

Typo on "corresponding" ("correspondong"). One of the back ticks is missing in the vector scan section where you writing "The row indices corresponding to origin == "LGA" anddest == “TPA”` are obtained using key based subset."

jangorecki commented 9 years ago

@markdanese regarding the

Why use unique in the first key when accessing only the second?

flights[.(unique(origin), "MIA")]

Not sure if you very asking to suggest better explanation or you are not aware of more complex usage of multiple column key. You cannot simply use binary search on dest when your key is c(origin, dest), you should have c(dest, origin) to use binary search on dest. Using .(unique(origin), "MIA") uses binary search, by providing all available values for the first column in key and then selective values to second column. I've made an extension to use only selective columns from key. looking at the simple example may also help you to understand. My extension is not ready to be PR to data.table master as the memory usage does not scale as good as it could if developed using internal data.table functions / combined with data.table secondary key.

Can you use a numeric as a key?

You can use numeric as key, it is mentioned in Keys and their properties section.

Any things to watch out for?

Not sure but setNumericRounding affects the numeric key, might be worth to mention in the vignette.

What if your key column has NA in it? Can you search for those and replace

Yes, the is.na() is optimized to use binary search. Try data.table(a=c(1,NA_real_),b=c("a","b"),key="a")[.(NA_real_), .SD ,verbose=TRUE]

Also to @arunsrinivasan, the typo in:

find the matching vlaues in

markdanese commented 9 years ago

Thanks Jan -- that is really helpful. I offered those questions as things that could briefly be mentioned in the vignette to help new users understand what is going on. They were things that came to mind (as a fairly new user) while reading the documentation. I can't really contribute to the code, so I am hoping to contribute by helping with the documentation.

arunsrinivasan commented 9 years ago

On Fri, Jan 23, 2015 at 8:48 PM, Mark Danese notifications@github.com wrote:

Very nice. I love these vignettes. Just some quick comments for consideration.

What is the purpose of taking over row names if they are not used? Or are they used by the special operators in j (like .N, .I, etc.)? I think they are used by data.table, but just not as indices. I have always been confused by the purpose of forcing the numbered row names.

Section 1a, just above Keys and their properties has answer to this. Data.tables inherit from data.frames.

Why use unique in the first key when accessing only the second? If you don't, you get a lot of repeated rows in the output, right? Maybe obvious, but it might be helpful to say/show what happens if you don't.

Again, this is explained exactly underneath in "what's happening here?". I even refer to the previous section where I lay the groundwork for explaining this one.

Do all keys need to be quoted? Even numeric (integer) ones? Can you use a numeric as a key? Any things to watch out for?

There's an example with integer columns on 2d. I thought that was sufficient?

What if your key column has NA in it? Can you search for those and replace them (as you did in your example where your replaced 24 with 0?

Good point. That's a difference with vector scan. Will try to add this.

It might help to explain that keyby applies to the output data.table ( ans in your example) and not the input data.table (flights in your example).

'keyby' was already discussed in the first vignette. But I'll see if this can be added.

Can you pass a vector to the key? In other words, can you create airport <- c("LGA", "JFK", "EWR")and useairportdirectly ini` in your example near the bottom? This might help set up the idea of passing a different data.table in for a merge.

Content for next section. That is how we transition into joins.

Typo on "corresponding" ("correspondong"). One of the back ticks is missing in the vector scan section where you writing "The row indices corresponding to origin == "LGA" anddest == “TPA”` are obtained using key based subset."

Thanks.

— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/944#issuecomment-71253738 .

smartinsightsfromdata commented 9 years ago

Great work on these vignettes! My comments may be late or already covered:

for (j in  valCols)
   set(dt_,  
    i = which(is.na(dt_[[j]])),
    j = j, 
    value= as.numeric(originTable[[j]]))
arunsrinivasan commented 9 years ago

Added Reshape vignette to Wiki.

juancentro commented 9 years ago

Excellent functionality and vignette! Thanks Arun

On Tue, Jun 23, 2015, 21:02 Arun notifications@github.com wrote:

Added Reshape vignette https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-reshape.html to Wiki https://github.com/Rdatatable/data.table/wiki/Getting-started.

— Reply to this email directly or view it on GitHub https://github.com/Rdatatable/data.table/issues/944#issuecomment-114678716 .

jangorecki commented 9 years ago

man for patterns would be good. Great vignette

markdanese commented 9 years ago

Isn't reshape2 required to be loaded to use these commands? If so, then that should be mentioned. I really like the focus on "wide to long" and "long to wide". I absolutely hate the syntax of reshape2 (for example, I think "make_wide" is much more clear than "dcast"). For this reason, I would not write the section headers as "melting data.tables" and "casting data.tables". That only make sense for people who are familiar with the reshape2 package. I might begin with headers that are more universal as above ("long to wide").

For what it is worth, I can't get the first line of the vignette to run using a fresh R session with just data.table loaded. I have no idea why (maybe mode should be "w" and not "wb"), but DT = fread("https://raw.githubusercontent.com/wiki/Rdatatable/data.table/data/melt_default.csv") returns Error in download.file(input, tt, mode = "wb") : unsupported URL scheme

As always, thanks for doing this. It is really useful.

arunsrinivasan commented 9 years ago

@markdanese thanks for the excellent feedback.

  1. reshape2 won't be required from data.table v1.9.6. Updated this in the vignette as well.
  2. Added 'wide to long' and 'long to wide' to titles, and other places to avoid confusion to people who are new to this topic.
  3. https functionality in fread is implemented in the devel version. So you won't be able to run that code yet with v1.9.4. Either update, or wait a bit :-).

Thanks for your encouragement.

@jangorecki patterns() won't be exported. The usage will be expanded for [.data.table to be used for selecting columns, :=, .SDcols etc..

jangorecki commented 9 years ago

@arunsrinivasan still the manual for patterns may help, the same way there is one for :=. Just because many people (I think) use ?fun to understand the code they read.

jangorecki commented 8 years ago

In the join vignette it may be worth to add corresponding SQL examples of data.table joins so it can be easier to pickup for db guys. Examples of corresponding SQL statement can be found for example in SO How to join (merge) data frames (inner, outer, left, right)?.

MichaelChirico commented 8 years ago

Would also be cool to have some "Refugees" vignettes --

etc. Like a quick-start guide, but oriented towards emigrees.

arunsrinivasan commented 8 years ago

Added Secondary indices and auto indexing vignette. This should allow smooth transition from subsets to joins for the next vignette I'll work on.

jangorecki commented 8 years ago

@arunsrinivasan isn't more appropriate to not use secondary in relation to indices? it was used for keys where it was important. Now seems to be redundant once we switch to index naming.

MichaelChirico commented 8 years ago

@jangorecki I think "secondary" is useful for its relation to keys (primary), perhaps:

Secondary sorting

Is a better description?

jangorecki commented 8 years ago

but already the index word has been used, it looks nicer than secondary sorting :)

MichaelChirico commented 8 years ago

So you would just name it "auto indexing"? IMO "secondary sorting and auto indexing" feels more informative

jangorecki commented 8 years ago

auto can be somehow misleading, as indexes should works for auto creating index, and also for use of manually created indexes - #1422 address current limitation in that matter.

MichaelChirico commented 8 years ago

I see. I'm still missing your preferred alternative -- just "Indices"?

jangorecki commented 8 years ago

not perfect but preferred over secondary indices

markdanese commented 8 years ago

I like this latest vignette a lot. My only thought was that it might be helpful to mention what types of operations cause the index to be dropped. From my testing, it seems pretty much anything that changes the number of rows, or any operation involving the indexed column.

I thought the examples of "on" were really helpful.

arunsrinivasan commented 8 years ago

@markdanese good point, will add.

pakom commented 7 years ago

Thank you for the updated vignettes with the release of v1.9.8. The "Reference semantics" refers to the copy() function and its new capabilities to make shallow copies (especially inside functions, something that I am really interested in):

"However we could improve this functionality further by shallow copying instead of deep copying. In fact, we would very much like to provide this functionality for v1.9.8. We will touch up on this again in the data.table design vignette."

But the design vignette is missing and the link points to an old issue. The reference manual does not provide more information on copy() than the one provided in the vignette. The rest of the vignettes do not provide any information on copy.

Will this vignette become available soon?

MichaelChirico commented 7 years ago

+1 for internals vignette. I (and I guess a few others) am quite interested in contributing a bit on the C side of things, but am a bit intimidated by the (as it stands) 35k lines of C code... quite the learning curve to 'go it alone' -- an intro to internals could do wonders!

MichaelChirico commented 5 years ago

For Joins vignette:

https://github.com/Rdatatable/data.table/issues/2396

zeomal commented 4 years ago

Wanted to chime in and ask if contributions to the vignette are accepted from non-code contributors (like me). I am particularly interested in contributing to the joins vignette as I had quite a bit of trouble with it initially and was guided to solutions from Arun's answers on Stackoverflow, and I'd like some guidance on how to do so, if allowed.

Henrik-P commented 4 years ago

@arunsrinivasan I see that you have a point IDateTime vignette. Perhaps it could be included in the more general vignette suggested by @jangorecki: vignettes: timeseries - ordered observations?

In addition, I am preparing a first draft on some of the topics suggested by jan. Perhaps parts of it may be relevant for a join vignette as well? I'm happy to share if anyone may find it useful.

MichaelChirico commented 4 years ago

@zeomal such a contribution would be highly valuable and much appreciated!

zeomal commented 4 years ago

@MichaelChirico, thank you. @Henrik-P, will your brief on normal joins be comprehensive - i.e. will your focus be more on timeseries? If not, I can start work on it - I haven't used rolling joins yet, so no knowledge there. :)