Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 983 forks source link

TODO items from tests.Rraw #2572

Closed MichaelChirico closed 6 months ago

MichaelChirico commented 6 years ago

Items cut out from the end of tests.Rraw when this issue was first created:


TODO items sprinkled throughout the rest of the file:


TO DO items sprinkled throughout the rest of the file:

MichaelChirico commented 6 years ago

My idea for testing:

use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.

is to run:

t1 = system.time({
    DT = data.table(a = 1:10)
    for (jj in paste0(seq_len(100))) DT[ , (jj) := 10:1]
})

t2 = system.time({
    DT = data.table(a = 1:10)
    for (jj in paste0(seq_len(100))) set(DT, , jj, 10:1)
})

ratio = t1['elapsed']/t2['elapsed']

Then error if ratio is outside (1, R]. On my machine right now, the ratio is about 8; I guess setting R to 10 would be too tight/potentially machine-dependent. How about choosing R = 50? 25?

Could also make the loop more expensive (absolute time of t1 is currently about .025 on my machine -- very fast, but not sure what the right run time is for the test suite)

MichaelChirico commented 6 years ago

@mattdowle the following doesn't appear to be true anymore? unless I missed something about what's intended...

test that ordered subsets when i is unkeyed now retain x's key (using is.sorted(f__))

DT1 = data.table(ID = 1:4, v = 4:1, key = 'ID')
DT2 = data.table(ID = 3:6, w = c(3, 4, 1, 2))
key(DT1[DT2])
# NULL
key(DT1[DT2, on = 'ID'])
# NULL
DT1[DT2]
#    ID  v w
# 1:  3  2 3
# 2:  4  1 4
# 3:  5 NA 1
# 4:  6 NA 2

FWIW, I see that the line is.sorted(f__) has been commented out.


Also, I wasn't able to track down any question on SO on any March 14 mentioning eval or with you as a user (user:403310):

add FAQ that eval() is evaled in calling frame so don't need a, then update SO question of 14 March. See the test using variable name same as column name. Actually, is that true? Need "..J".

MichaelChirico commented 6 years ago

@st-pasha you're more familiar with the fread test suite, is this covered by now?

Test mid read bump of logical T/F to character, collapse back to T and F.

st-pasha commented 6 years ago

I don't remember if there is such a test specifically, but it's easy to add:

DT = data.table(A=rep("True", 2200), B="FALSE")
DT[111, A:="here be"]
DT[111, B:="dragons"]
fwrite(DT, f<-tempfile())
data.table:::test(9999, fread(f, verbose=TRUE), DT, output="Column 2 (\"B\") bumped from 'bool8' to 'string' due to <<dragons>> on row 110")
MichaelChirico commented 6 years ago

@st-pasha thanks, I wasn't sure if there's an exact row number to pick to ensure re-read.

Tried beefing it up, but fread doesn't take "T" "F" to be bool?

DT = data.table(A=rep("True", 2200), B="FALSE", C="T", D='0')
DT[111, LETTERS[1:4] := .("here", "be", "dra", "gons")]
fwrite(DT, f<-tempfile())
fread(f, verbose=TRUE)
MichaelChirico commented 6 years ago

@mattdowle this one appears to be obsolete as well?

check the "j is named list could be inefficient" message from verbose than Chris N showed recently to 15 May

grep "named list" R/* comes up blank.

MichaelChirico commented 6 years ago

re:

tests on double in add hoc by

I am making a test along the lines of:

DT = data.table(a = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), v = 1:12,
                jig = 10^(289:300))
DT[ , sum(v), by = .(a_jiggle = a + jig * .Machine$double.xmin)]
#     a_jiggle V1
#  1:        1  6
#  2:        1  4
#  3:        1  5
#  4:        1  6
#  5:        2  7
#  6:        2  8
#  7:        2  9
#  8:        2 10
#  9:        2 11
# 10:        2 12

I seem to recall it being documented somewhere what exactly the internal limits of accuracy are/how to query this for data.table; on my machine, it appears that roughly disturbances less than 1e-17 don't make a difference... not sure the robust/machine-independent way to design this test, however.

MichaelChirico commented 5 years ago
  • use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.

@jangorecki this can't make it to the CRAN test suite, but do we have anywhere we could readily drop in such a test (more like a dashboard, really) in dev?

MichaelChirico commented 5 years ago

test and highlight in docs that negatives are fine and fast in forderv (ref R wish #15644)

?forder already mentions the use of - in both setorder and setorderv. If anything the documentation can be cleaned up -- hardly relevant to be pointing out differences of 1.9.2 vs 1.9.4 (i.e. 4.5-5 years ago)... that set ?setorder could go for a tidying

MichaelChirico commented 5 years ago

check the "j is named list could be inefficient" message from verbose than Chris N showed recently to 15 May

grep "named list" R/* returns nothing; I think this is obsolete

MichaelChirico commented 5 years ago

tests on double in add hoc by

IIUC we're not really supposed to be testing things like numerical accuracy. Limitations on double sorting are spelled out in the documentation IIRC... and anyway we haven't seen any issues raised on this in a long time. So I think this is covered.

jangorecki commented 5 years ago
  • use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.

@jangorecki this can't make it to the CRAN test suite, but do we have anywhere we could readily drop in such a test (more like a dashboard, really) in dev?

There is benchmark.Rraw, or if there is no, there is a plan to have one. This will be proper place for all tests that needs to call system.time(). It should be now easier to add it as summary footer of Rraw script has been moved to test.data.table().

MichaelChirico commented 5 years ago

FWIW, did the following benchmark of set vs. :=:

library(microbenchmark)
library(data.table)

set.seed(5419681)

BB = 100L
times = CJ(rows = 10^(3:6), cols = c(10, 50, 100, 1000), 
           type = c(':=', 'set'), repetition = 1:BB)

for (ii in seq_len(nrow(times))) {
  gc()
  times[ii, duration := switch(type, ':=' = {
    DT = data.table(ID = seq_len(rows))
    t0 = get_nanotime()
    DT[ , paste0('V', 1:cols) := lapply(integer(cols), function(...) 1L)]
    get_nanotime() - t0
  }, 'set' = {
    DT = data.table(ID = seq_len(rows))
    t0 = get_nanotime()
    for (jj in 1:cols) set(DT, , paste0('V', jj), 1L)
    get_nanotime() - t0
  })]
}

times[ , .(ratio = duration[type == ':=']/duration[type == 'set'],
           avg = mean(duration)), keyby = .(rows, cols, repetition)
       ][ , as.list(c(absolute_seconds = mean(avg)/1e9, summary(ratio))), 
          keyby = .(rows, cols)]
#      rows cols absolute_seconds      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
#  1: 1e+03   10     0.0003131304 3.0605254 3.8471289 4.0469033 4.1023044 4.3067497  6.915743
#  2: 1e+03   50     0.0008448426 1.0162382 1.3855990 1.4398864 1.5730537 1.5222325 11.719836
#  3: 1e+03  100     0.0014121146 0.3493482 1.1065934 1.1422921 1.1502817 1.2021676  1.431746
#  4: 1e+03 1000     0.0182234995 0.2997860 0.4154354 0.4314394 0.4375887 0.4445711  0.769559
#  5: 1e+04   10     0.0006716579 1.5684996 1.8859146 1.9747949 1.9763925 2.0480862  2.625070
#  6: 1e+04   50     0.0024221906 0.7199252 1.0999812 1.1368900 1.1405756 1.1763198  2.045857
#  7: 1e+04  100     0.0049328084 0.4010431 1.0242058 1.0517822 1.0688827 1.0850502  2.420760
#  8: 1e+04 1000     0.0540806265 0.5498318 0.7804711 0.8012923 0.8062797 0.8221343  1.134574
#  9: 1e+05   10     0.0038287573 0.8844346 1.1154718 1.1282488 1.1298522 1.1366939  2.021127
# 10: 1e+05   50     0.0280899392 0.7681287 0.9787483 0.9973578 0.9989867 1.0211156  1.274340
# 11: 1e+05  100     0.0600740518 0.5545713 0.9471064 0.9744283 0.9680503 0.9994554  1.086229
# 12: 1e+05 1000     0.4934547127 0.8932713 0.9409418 0.9509392 0.9609168 0.9675374  1.573699
# 13: 1e+06   10     0.0358802545 0.7430620 0.9720593 0.9927718 0.9938076 1.0076720  1.534889
# 14: 1e+06   50     0.2703836096 0.7398572 0.9533737 0.9785185 0.9942344 1.0259876  1.501244
# 15: 1e+06  100     0.4599221823 0.9430767 0.9883373 1.0063509 1.0110553 1.0279786  1.265928
# 16: 1e+06 1000     3.9212144387 0.8491473 0.9790677 0.9865773 0.9899165 0.9928261  1.220000
cat('\nTotal Time:', times[ , sum(duration)/1e9/60], 'minutes\n')
# Total Time: 17.8525 minutes
MichaelChirico commented 5 years ago

Add to warning about a previous copy that class<-, levels<- can also copy whole vector. Any fun<- form basically.

Not sure where it is the intention to put this. Axing barring that.

jangorecki commented 5 years ago

why assign rnorm? it costs good to call gc before each timing start.

MichaelChirico commented 5 years ago

why assign rnorm? it costs

best to just assign TRUE?

will do re:gc

MichaelChirico commented 5 years ago

updated set/:= benchmark, 20x as fast when not using rnorm

jangorecki commented 5 years ago

Results presented above are difficult to interpret. I would find that useful to melt results into something like:

rows, cols, first_set, first_:=, mean_set, mean_:=, median_set, median_:=, relative_first, relative_mean, relative_median
MichaelChirico commented 5 years ago

!make sure explicitly that unnamed lists are being executed by dogroups!

honestly no idea what this is about. IINM dogroups only entryway in R is when computing groups and !GForce here.

Pretty trivial to see that there's plenty of coverage of this in tests already (e.g. some places in #3406 where optimization is turned off, among probably hundreds of others).

Alternative interpretation would be that .ok should be FALSE if j is a named list? Though this is obviously not the case (maybe it was at some point?)

Gonna mark this one done

MichaelChirico commented 5 years ago

tests of freading classes like Date and the verbose messages there.

Marking this complete within this issue since it's covered in a separate PR/issue already

MichaelChirico commented 5 years ago

confirm with Matt the behavior of rbindlist w.r.t. naming of input in edge cases -- rbindlist should look for the first non-empty data.table - New changes (from Arun). Explanation below: Even if data.table is empty, as long as there are column names, they should be considered. Ex: What if all data.tables are empty? What'll be the column name then? If there are no names, then the first non-empty set of names will be allocated. I think this is the way to do it...

@mattdowle any feedback here?

MichaelChirico commented 5 years ago
  • "Ragged" files that should have fill = TRUE generate warnings, not errors; as such, we need another way to trigger a STOP() inside fread.c. options(warn=2) isn't enough.

@mattdowle ditto here... could you split off into a new issue if this is still a concern?

MichaelChirico commented 5 years ago

# TO DO: TODO: think of and add more tests for rbindlist

This is quite vague and 5 years old... closing as duplicate of the 40+ rbindlist issues outstnding

MichaelChirico commented 5 years ago

allow.cartesian is ignored if jsub[1L] has :=; could still warn if i has duplicates

AFAICT this is subsumed by #3077; closing here

MichaelChirico commented 5 years ago

Add comment=="#". Ensure only after last field is observed.

Presumably this is just #856; closing here

MichaelChirico commented 5 years ago

Add integer64 support for transpose

This is a full issue; filed as #3408 and closing here

MichaelChirico commented 5 years ago

Add a test when fill=FALSE, but blank.lines.skip=TRUE, when the same effect should happen

Unclear what "same effect" is referring to; I see tests 1584.x that seem to cover this case? @arunsrinivasan this is under you per git blame (admittedly 3+ years ago), any idea?

MichaelChirico commented 5 years ago

Fix and add test for cases like this: a,b,c\n1,2,3\n4,5,6\n7,8,9,6\n1,2,3\n... -- extra column in line 4, but we've only detected 3

This seems to be handled already:

s = "a,b,c
1,2,3
4,5,6
7,8,9,6
1,2,3"

fread(s)
#        a     b     c
#    <int> <int> <int>
# 1:     1     2     3
# 2:     4     5     6

Warning message: In fread(s) : Stopped early on line 4. Expected 3 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<7,8,9,6>>

fread(s, fill=TRUE)
#        a     b     c    V4
#    <int> <int> <int> <int>
# 1:     1     2     3    NA
# 2:     4     5     6    NA
# 3:     7     8     9     6
# 4:     1     2     3    NA
MichaelChirico commented 5 years ago

Add tests for nomatch=NA in non-equi-join suite (tests 1640-ish)... tested, but takes quite some time... so commented for now

@jangorecki this set of tests could be added to the GitLab suite?

Is the TODO covered by the commented tests?

# TODO: add tests for nomatch=NA..
# tested, but takes quite some time.. so commenting for now
# nqjoin_test(x, y, 3L,1643.0)
# nqjoin_test(dt1,dt2,3L,1652.0)

# nqjoin_test(  x,dt2,1L,1644.0) # without NA only in x
# nqjoin_test(  x,dt2,2L,1645.0)
# nqjoin_test(  x,dt2,3L,1646.0)
# nqjoin_test(dt1,  y,1L,1647.0) # without NA only in i
# nqjoin_test(dt1,  y,2L,1648.0)
# nqjoin_test(dt1,  y,3L,1649.0)
MichaelChirico commented 5 years ago

0.0 written as 0, but TODO #2398, probably related to the 2 lines after l==0 missing coverage in writeFloat64

As stated, this is just #2398

MichaelChirico commented 5 years ago

Add strict.numeric (default FALSE) to all.equal.data.table()?

@jangorecki can revisit this and file as a separate issue if still viable? But I guess no because of the difficulty in testing identical for numeric across platforms? Or could include it but discourage it "use at your own risk"?

MichaelChirico commented 5 years ago

Begin to deprecate logicalAsInt argument

This was 1.11.0 (May 2018); can begin the deprecation process with next release?

MichaelChirico commented 5 years ago

#1416 TODO (trailing tabs on most but not at the beginning and a "-" intended to mean missing but taken as text column name)

As stated, this is just #1416

MichaelChirico commented 5 years ago

Begin deprecation of ..x as valid column names

This is also 1.11.0 (May 2018). But see #3321, making things more complicated...

MichaelChirico commented 6 months ago

Same as #678 -- better to handle this with a linter