Vignettes - Githubissues

arunsrinivasan commented 9 years ago

HTML vignette series:

Planned for v1.9.8

[ ] Quick tour of data.table
[x] Keys and fast binary search based subset
[x] Secondary indices and auto indexing
[ ] Joins vignette. a) joins vs subsets -- extending binary search based subset to joins + conditional / non-equi joins, rolling and interval joins. b) by=.EACHI, join + update feature. c) Document i.col usage as filed in #1038. d) Also cover about performance/advantages from #1232.
~~[ ] Cover get() and mget(). E.g., http://stackoverflow.com/q/33785747/559784~~ covered in #4304
[ ] Add about on= argument rationale in FAQ (#1623).
[ ] FAQ 5.3 needs to mention that it's a shallow copy that's done in order to restore over-allocation. Thanks to Jan for linking it in #1729.

Future releases

[ ] data.table internals, performance aspects and expressiveness
[ ] Reading multiple files (fread + rbindlist), ordering, ranking and set operations
[ ] IDateTime vignette
[ ] Document the difference between data.table() and data.frame() somewhere - relevant issues: #968, #877. Perhaps slightly more in detail in the FAQ.
[ ] coursera FAQ
[ ] Advanced data.table usage:
- [ ] NSE
- [ ] ...
[ ] Timings vignette (moving #520 here to get everything in one place, but not sure if we need it as a vignette since we've the Wiki with benchmarks/timings).
[ ] fread+fwrite vignette, include also Convenience features of fread wiki, also https://github.com/Rdatatable/data.table/issues/2855

Finished:

[x] Introduction to data.table - data.table syntax, general form, subset rows in i, select / do in j and aggregations using by.
[x] Reference Semantics (add/update/delete columns by reference, and see that we can combine with i and by in the same way as before)
[x] Efficient reshaping using data.tables
[x] Link to this answer on SO on by=.EACHI until the vignette is done.

Minor:

[ ] Operations using integer64, and promoting it for large integers.

Notes (to update current vignettes based on feedbacks): Please let me know if I missed anything..

Introduction to data.table:

[x] order in i.
[x] Explain how to name columns in j while selecting/computing.
[x] Emphasise that keyby is applied after obtaining the result on the computed result, not on the original data.table.
[x] Mention new updates to .SDcols and cols in with=FALSE being able to select columns as colA:colB.
Reference semantics:
[ ] Also explain all other relevant set* functions here.. (setnames, setcolorder etc..)
[ ] Mainly set.
[x] Explain that 1b) the := operator is just defining ways to use it - the example there doesn't work as it just shows two different ways of using it -- Following this comment.
Keys and fast binary search based subsets:
[ ] Add an example of subset using integer/double keys.
[ ] Difference in "nomatch" default in binary search based subsets.
[ ] replacing NAs with binary search based subsets possible?
FAQ (most appropriate here, I think).
[x] Update FAQ with issue on external pointer being NULL when reading an R object from file, for example, using readRDS(). Update this SO post.
[ ] Explain with example, on over allocating the data.table using alloc.col(), and when to use it (when you need to create multiple columns), and why. Update this SO post.

Henrik-P commented 4 years ago

@zeomal Hopefully I will be able to upload the first draft soon, so you can have a look at it. In my draft, I provide a simple example of a "normal" join on a single variable, time, where there are non-matching rows. I use nomatch = NA. (maaaybe also a quick example with nomatch = NULL)

My idea was that this simple join could provide a context and a feeling for the problem, which I then treat more thoroughly in the following sections on rolling and non-equi joins et al.

Thanks a lot for your willingness to contribute! .

zeomal commented 4 years ago

I have a question on joining by reference, while preparing the vignettes. The X[Y, new_col := old_col] performs something similar to a traditional left join on X. However, if there are multiple matches to Y's keys in X, only the last (or first?) matching value of the key is retained. Is this explicitly documented somewhere? I had tried searching for this back when I encountered it, but had to resort to my understanding of updating by reference for the reason. For a reproducible example,

> X = data.table(a = c(1, 2, 3), m = c("a", "b", "c"))
> Y = data.table(b = c(1, 1, 4), n = c("x", "y", "z"))
> X[Y, new_col := i.n, on = "a == b"]
   a m new_col
1: 1 a       y
2: 2 b    <NA>
3: 3 c    <NA>

# an ideal left join - expected behaviour per a new user, given below
# not possible because updating row by reference isn't implemented
   a m new_col
1: 1 a       x
1: 1 a       y
2: 2 b    <NA>
3: 3 c    <NA>

This is expected behaviour, but isn't exactly straightforward for a new user. mult does not impact the output either. Any suggestions on how I document this? Add merge as a workaround for a proper left join?

jangorecki commented 4 years ago

@zeomal please post your future question about join vignette in #2181 issue instead. It seems to better place. It is documented in set.

Henrik-P commented 4 years ago

@zeomal If you wish to check how brief my treatment on normal (equi) joins is, I just want to let you know that I posted a PR on a timeseries vignette.

kjytay commented 3 years ago

Minor typo in vignettes/datatable-reshape.Rmd lines 113 and 129: DT.m should be replaced with DT.m1.

MichaelChirico commented 3 years ago

@kjytay could you please file a PR fixing that? you should be able to do so in the GitHub UI so it should be pretty quick

kjytay commented 3 years ago

Ok done!

Rdatatable / data.table

Vignettes #944

HTML vignette series:

Minor:

Introduction to data.table:

Reference semantics:

Keys and fast binary search based subsets:

FAQ (most appropriate here, I think).