Closed MichaelChirico closed 6 months ago
My idea for testing:
use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.
is to run:
t1 = system.time({
DT = data.table(a = 1:10)
for (jj in paste0(seq_len(100))) DT[ , (jj) := 10:1]
})
t2 = system.time({
DT = data.table(a = 1:10)
for (jj in paste0(seq_len(100))) set(DT, , jj, 10:1)
})
ratio = t1['elapsed']/t2['elapsed']
Then error if ratio
is outside (1, R]
. On my machine right now, the ratio is about 8; I guess setting R
to 10 would be too tight/potentially machine-dependent. How about choosing R = 50
? 25
?
Could also make the loop more expensive (absolute time of t1
is currently about .025 on my machine -- very fast, but not sure what the right run time is for the test suite)
@mattdowle the following doesn't appear to be true anymore? unless I missed something about what's intended...
test that ordered subsets when
i
is unkeyed now retainx
's key (usingis.sorted(f__)
)
DT1 = data.table(ID = 1:4, v = 4:1, key = 'ID')
DT2 = data.table(ID = 3:6, w = c(3, 4, 1, 2))
key(DT1[DT2])
# NULL
key(DT1[DT2, on = 'ID'])
# NULL
DT1[DT2]
# ID v w
# 1: 3 2 3
# 2: 4 1 4
# 3: 5 NA 1
# 4: 6 NA 2
FWIW, I see that the line is.sorted(f__)
has been commented out.
Also, I wasn't able to track down any question on SO on any March 14 mentioning eval
or with you as a user (user:403310
):
add FAQ that
eval()
iseval
ed in calling frame so don't need a, then update SO question of 14 March. See the test using variable name same as column name. Actually, is that true? Need "..J".
@st-pasha you're more familiar with the fread
test suite, is this covered by now?
Test mid read bump of logical T/F to character, collapse back to T and F.
I don't remember if there is such a test specifically, but it's easy to add:
DT = data.table(A=rep("True", 2200), B="FALSE")
DT[111, A:="here be"]
DT[111, B:="dragons"]
fwrite(DT, f<-tempfile())
data.table:::test(9999, fread(f, verbose=TRUE), DT, output="Column 2 (\"B\") bumped from 'bool8' to 'string' due to <<dragons>> on row 110")
@st-pasha thanks, I wasn't sure if there's an exact row number to pick to ensure re-read.
Tried beefing it up, but fread
doesn't take "T"
"F"
to be bool
?
DT = data.table(A=rep("True", 2200), B="FALSE", C="T", D='0')
DT[111, LETTERS[1:4] := .("here", "be", "dra", "gons")]
fwrite(DT, f<-tempfile())
fread(f, verbose=TRUE)
@mattdowle this one appears to be obsolete as well?
check the "j is named list could be inefficient" message from verbose than Chris N showed recently to 15 May
grep "named list" R/*
comes up blank.
re:
tests on double in add hoc by
I am making a test along the lines of:
DT = data.table(a = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), v = 1:12,
jig = 10^(289:300))
DT[ , sum(v), by = .(a_jiggle = a + jig * .Machine$double.xmin)]
# a_jiggle V1
# 1: 1 6
# 2: 1 4
# 3: 1 5
# 4: 1 6
# 5: 2 7
# 6: 2 8
# 7: 2 9
# 8: 2 10
# 9: 2 11
# 10: 2 12
I seem to recall it being documented somewhere what exactly the internal limits of accuracy are/how to query this for data.table
; on my machine, it appears that roughly disturbances less than 1e-17
don't make a difference... not sure the robust/machine-independent way to design this test, however.
- use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.
@jangorecki this can't make it to the CRAN test suite, but do we have anywhere we could readily drop in such a test (more like a dashboard, really) in dev?
test and highlight in docs that negatives are fine and fast in forderv (ref R wish #15644)
?forder
already mentions the use of -
in both setorder
and setorderv
. If anything the documentation can be cleaned up -- hardly relevant to be pointing out differences of 1.9.2 vs 1.9.4 (i.e. 4.5-5 years ago)... that set ?setorder
could go for a tidying
check the "j is named list could be inefficient" message from verbose than Chris N showed recently to 15 May
grep "named list" R/*
returns nothing; I think this is obsolete
tests on double in add hoc by
IIUC we're not really supposed to be testing things like numerical accuracy. Limitations on double sorting are spelled out in the documentation IIRC... and anyway we haven't seen any issues raised on this in a long time. So I think this is covered.
- use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.
@jangorecki this can't make it to the CRAN test suite, but do we have anywhere we could readily drop in such a test (more like a dashboard, really) in dev?
There is benchmark.Rraw, or if there is no, there is a plan to have one. This will be proper place for all tests that needs to call system.time()
. It should be now easier to add it as summary footer of Rraw script has been moved to test.data.table()
.
FWIW, did the following benchmark of set
vs. :=
:
library(microbenchmark)
library(data.table)
set.seed(5419681)
BB = 100L
times = CJ(rows = 10^(3:6), cols = c(10, 50, 100, 1000),
type = c(':=', 'set'), repetition = 1:BB)
for (ii in seq_len(nrow(times))) {
gc()
times[ii, duration := switch(type, ':=' = {
DT = data.table(ID = seq_len(rows))
t0 = get_nanotime()
DT[ , paste0('V', 1:cols) := lapply(integer(cols), function(...) 1L)]
get_nanotime() - t0
}, 'set' = {
DT = data.table(ID = seq_len(rows))
t0 = get_nanotime()
for (jj in 1:cols) set(DT, , paste0('V', jj), 1L)
get_nanotime() - t0
})]
}
times[ , .(ratio = duration[type == ':=']/duration[type == 'set'],
avg = mean(duration)), keyby = .(rows, cols, repetition)
][ , as.list(c(absolute_seconds = mean(avg)/1e9, summary(ratio))),
keyby = .(rows, cols)]
# rows cols absolute_seconds Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1: 1e+03 10 0.0003131304 3.0605254 3.8471289 4.0469033 4.1023044 4.3067497 6.915743
# 2: 1e+03 50 0.0008448426 1.0162382 1.3855990 1.4398864 1.5730537 1.5222325 11.719836
# 3: 1e+03 100 0.0014121146 0.3493482 1.1065934 1.1422921 1.1502817 1.2021676 1.431746
# 4: 1e+03 1000 0.0182234995 0.2997860 0.4154354 0.4314394 0.4375887 0.4445711 0.769559
# 5: 1e+04 10 0.0006716579 1.5684996 1.8859146 1.9747949 1.9763925 2.0480862 2.625070
# 6: 1e+04 50 0.0024221906 0.7199252 1.0999812 1.1368900 1.1405756 1.1763198 2.045857
# 7: 1e+04 100 0.0049328084 0.4010431 1.0242058 1.0517822 1.0688827 1.0850502 2.420760
# 8: 1e+04 1000 0.0540806265 0.5498318 0.7804711 0.8012923 0.8062797 0.8221343 1.134574
# 9: 1e+05 10 0.0038287573 0.8844346 1.1154718 1.1282488 1.1298522 1.1366939 2.021127
# 10: 1e+05 50 0.0280899392 0.7681287 0.9787483 0.9973578 0.9989867 1.0211156 1.274340
# 11: 1e+05 100 0.0600740518 0.5545713 0.9471064 0.9744283 0.9680503 0.9994554 1.086229
# 12: 1e+05 1000 0.4934547127 0.8932713 0.9409418 0.9509392 0.9609168 0.9675374 1.573699
# 13: 1e+06 10 0.0358802545 0.7430620 0.9720593 0.9927718 0.9938076 1.0076720 1.534889
# 14: 1e+06 50 0.2703836096 0.7398572 0.9533737 0.9785185 0.9942344 1.0259876 1.501244
# 15: 1e+06 100 0.4599221823 0.9430767 0.9883373 1.0063509 1.0110553 1.0279786 1.265928
# 16: 1e+06 1000 3.9212144387 0.8491473 0.9790677 0.9865773 0.9899165 0.9928261 1.220000
cat('\nTotal Time:', times[ , sum(duration)/1e9/60], 'minutes\n')
# Total Time: 17.8525 minutes
Add to warning about a previous copy that class<-, levels<- can also copy whole vector. Any fun<- form basically.
Not sure where it is the intention to put this. Axing barring that.
why assign rnorm
? it costs
good to call gc
before each timing start.
why assign
rnorm
? it costs
best to just assign TRUE
?
will do re:gc
updated set/:= benchmark, 20x as fast when not using rnorm
Results presented above are difficult to interpret. I would find that useful to melt results into something like:
rows, cols, first_set, first_:=, mean_set, mean_:=, median_set, median_:=, relative_first, relative_mean, relative_median
!make sure explicitly that unnamed lists are being executed by dogroups!
honestly no idea what this is about. IINM dogroups
only entryway in R is when computing groups and !GForce
here.
Pretty trivial to see that there's plenty of coverage of this in tests already (e.g. some places in #3406 where optimization is turned off, among probably hundreds of others).
Alternative interpretation would be that .ok
should be FALSE
if j
is a named list? Though this is obviously not the case (maybe it was at some point?)
Gonna mark this one done
tests of freading classes like Date and the verbose messages there.
Marking this complete within this issue since it's covered in a separate PR/issue already
confirm with Matt the behavior of
rbindlist
w.r.t. naming of input in edge cases -- rbindlist should look for the first non-empty data.table - New changes (from Arun). Explanation below: Even if data.table is empty, as long as there are column names, they should be considered. Ex: What if all data.tables are empty? What'll be the column name then? If there are no names, then the first non-empty set of names will be allocated. I think this is the way to do it...
@mattdowle any feedback here?
- "Ragged" files that should have
fill = TRUE
generate warnings, not errors; as such, we need another way to trigger a STOP() inside fread.c. options(warn=2) isn't enough.
@mattdowle ditto here... could you split off into a new issue if this is still a concern?
# TO DO: TODO: think of and add more tests for rbindlist
This is quite vague and 5 years old... closing as duplicate of the 40+ rbindlist
issues outstnding
allow.cartesian
is ignored ifjsub[1L]
has:=
; could still warn ifi
has duplicates
AFAICT this is subsumed by #3077; closing here
Add
comment=="#"
. Ensure only after last field is observed.
Presumably this is just #856; closing here
Add
integer64
support fortranspose
This is a full issue; filed as #3408 and closing here
Add a test when
fill=FALSE
, butblank.lines.skip=TRUE
, when the same effect should happen
Unclear what "same effect" is referring to; I see tests 1584.x that seem to cover this case? @arunsrinivasan this is under you per git blame
(admittedly 3+ years ago), any idea?
Fix and add test for cases like this:
a,b,c\n1,2,3\n4,5,6\n7,8,9,6\n1,2,3\n...
-- extra column in line 4, but we've only detected 3
This seems to be handled already:
s = "a,b,c
1,2,3
4,5,6
7,8,9,6
1,2,3"
fread(s)
# a b c
# <int> <int> <int>
# 1: 1 2 3
# 2: 4 5 6
Warning message: In fread(s) : Stopped early on line 4. Expected 3 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<7,8,9,6>>
fread(s, fill=TRUE)
# a b c V4
# <int> <int> <int> <int>
# 1: 1 2 3 NA
# 2: 4 5 6 NA
# 3: 7 8 9 6
# 4: 1 2 3 NA
Add tests for
nomatch=NA
in non-equi-join suite (tests 1640-ish)... tested, but takes quite some time... so commented for now
@jangorecki this set of tests could be added to the GitLab suite?
Is the TODO
covered by the commented tests?
# TODO: add tests for nomatch=NA..
# tested, but takes quite some time.. so commenting for now
# nqjoin_test(x, y, 3L,1643.0)
# nqjoin_test(dt1,dt2,3L,1652.0)
# nqjoin_test( x,dt2,1L,1644.0) # without NA only in x
# nqjoin_test( x,dt2,2L,1645.0)
# nqjoin_test( x,dt2,3L,1646.0)
# nqjoin_test(dt1, y,1L,1647.0) # without NA only in i
# nqjoin_test(dt1, y,2L,1648.0)
# nqjoin_test(dt1, y,3L,1649.0)
0.0 written as 0, but TODO #2398, probably related to the 2 lines after l==0 missing coverage in writeFloat64
As stated, this is just #2398
Add
strict.numeric
(defaultFALSE
) toall.equal.data.table()
?
@jangorecki can revisit this and file as a separate issue if still viable? But I guess no because of the difficulty in testing identical
for numeric
across platforms? Or could include it but discourage it "use at your own risk"?
Begin to deprecate
logicalAsInt
argument
This was 1.11.0 (May 2018); can begin the deprecation process with next release?
#1416 TODO (trailing tabs on most but not at the beginning and a "-" intended to mean missing but taken as text column name)
As stated, this is just #1416
Begin deprecation of
..x
as valid column names
This is also 1.11.0 (May 2018). But see #3321, making things more complicated...
Same as #678 -- better to handle this with a linter
Items cut out from the end of
tests.Rraw
when this issue was first created:lapply(f,eval,.SD)
?TODO
items sprinkled throughout the rest of the file:rbindlist
w.r.t. naming of input in edge cases -- rbindlist should look for the first non-empty data.table - New changes (from Arun). Explanation below: Even if data.table is empty, as long as there are column names, they should be considered. Ex: What if all data.tables are empty? What'll be the column name then? If there are no names, then the first non-empty set of names will be allocated. I think this is the way to do it...fill = TRUE
generate warnings, not errors; as such, we need another way to trigger a STOP() inside fread.c. options(warn=2) isn't enough.NaN
behavior inforder
(testing internal behavior since can't compare tobase::order
)rbindlist
allow.cartesian
is ignored ifjsub[1L]
has:=
; could still warn ifi
has duplicatescomment=="#"
. Ensure only after last field is observed.integer64
support fortranspose
fill=FALSE
, butblank.lines.skip=TRUE
, when the same effect as in tests 1585 should happena,b,c\n1,2,3\n4,5,6\n7,8,9,6\n1,2,3\n...
-- extra column in line 4, but we've only detected 3nomatch=NA
in non-equi-join suite (tests 1640-ish)... tested, but takes quite some time... so commented for now [split off as #3411]strict.numeric
(defaultFALSE
) toall.equal.data.table()
? [split off as #3410]logicalAsInt
argument..x
as valid column namesTO DO
items sprinkled throughout the rest of the file: