Open iago-pssjd opened 4 months ago
the inconsistency between using table(x) with by and without by is strange, I agree. but in data.table I think the better way of doing contingency tables is via
> DT[, .(count=.N), by=.(x,w)]
x w count
<char> <num> <int>
1: b 1 3
2: a 1 2
3: a 2 1
4: c 2 3
> dcast(DT, x + w ~ ., length)
Key: <x, w>
x w .
<char> <num> <int>
1: a 1 2
2: a 2 1
3: b 1 3
4: c 2 3
> dcast(DT, w ~ x, length)
Key: <w>
w a b c
<num> <int> <int> <int>
1: 1 2 3 0
2: 2 1 0 3
I find the example a bit confusing, what is DT[, x, by=w]
meant to be? Why wouldn't we just do table(DT$x, DT$w)
or DT[, table(w, x)]
?
Perhaps you just want to coerce the table output to data.table?
DT[, as.data.table(table(x)), by=w]
# w x N
# <num> <char> <int>
# 1: 1 a 2
# 2: 1 b 3
# 3: 2 a 1
# 4: 2 c 3
To do this "automatically", we'd have to do something like, for each entry in the output of j
, check if an as.data.table()
or as.data.frame()
method exists. My instinct is to let the user decide whether to do such coercion; regardless, it will almost surely be a breaking change to do this on the user's behalf starting now.
As Toby points out, dcast()
can also help filling out the "missing" entries here.
On second thought I agree with Michael, the consistency issue is actually not that surprising.
my mental model is that when there is no by
then we just run the code in j
and return the result, but when there is by
then we do something different: run the code in j
for each value of by
columns, store each result in a list, then call rbindlist to obtain final result.
Ok! such objections are fine, and we can close this issue, but on the last comment by @tdhock
but when there is
by
then we do something different: run the code inj
for each value ofby
columns, store each result in a list, then call rbindlist to obtain final result.
If that if the case, running the code in j
for each value, therefore, without by
data.table returns
> DT[, table(x)]
x
a b c
3 3 3
# not a data.table, but
> DT[, .(table(x))]
x N
<char> <int>
1: a 3
2: b 3
3: c 3
> as.data.table(DT[, table(x)])
x N
<char> <int>
1: a 3
2: b 3
3: c 3
so calling rbindlist
should preserve column x
. Shouldn't?
DT[, .(table(x))]
is running as.data.table.list()
, which in turn invokes as.data.table.table()
:
In DT[, table(x)]
, j
is not a list so as.data.table.list()
does not get invoked. So I would add a bit of nuance to Toby's mental model:
when there is no
by
andj
is not a list then we just run the code inj
and return the result, but when there isby
orj
is a list, then we do something different: run the code inj
for each value ofby
columns, store each result in a list (if not already done), applyas.data.table
to the list, then call rbindlist to obtain final result.
See next MRE:
table
withby
table
withoutby = ...
Same (except for
.(...)
) withsummary
:Ideally
(well, maybe rows 3 and 5 only would have sense if
x
is a factor) and similar forDT[, .(summary(v)), by = w]
withMin
, ...