TODO items from tests.Rraw

MichaelChirico commented 6 years ago

Items cut out from the end of tests.Rraw when this issue was first created:

[x] Tests involving GForce functions needs to be run with optimisation level 1 and 2, so that both functions are tested all the time.
[x] Add test for fixed bug #5519 - dcast returned error when a package imported data.table, but dint happen when "depends" on data.table. This is fixed (commit 1263 v1.9.3), but not sure how to add test. [#2571]
[x] test and highlight in docs that negatives are fine and fast in forderv (ref R wish #15644)
[x] bring ?setkey description up to date; #3399
[x] tests of freading classes like Date and the verbose messages there. [#2545]
[x] Test mid read bump of logical to character, rereads original True and False character strings faithfully.
[x] add examples of multiple LHS (name and position) and multiple RHS to example(":=")
[x] tests on double in add hoc by
[x] test on -i that retain key e.g. DT[-4] and DT[-4,sum(v),by=b] should both retain key
[x] test on out of bound i subsets e.g. 6:10 when DT has 7 rows, and mixed negative and positive i integer is error.
[x] test that ordered subsets when i is unkeyed now retain x's key (using is.sorted(f__))
[x] add FAQ that eval() is evaled in calling frame so don't need a, then update SO question of 14 March. See the test using variable name same as column name. Actually, is that true? Need "..J".
[x] revisit SO answer. Can there be a better way than the repeated lapply(f,eval,.SD)?
[x] change all 1 to 1L internally (done in data.table.R, other .R to do)
[x] check the "j is named list could be inefficient" message from verbose than Chris N showed recently to 15 May
[x] !make sure explicitly that unnamed lists are being executed by dogroups!
[x] Add to warning about a previous copy that class<-, levels<- can also copy whole vector. Any fun<- form basically.
[x] use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.
[x] add tests on smaller examples with NAs for 'frankv', even though can't compare to base::rank.

TODO items sprinkled throughout the rest of the file:

[ ] confirm with Matt the behavior of rbindlist w.r.t. naming of input in edge cases -- rbindlist should look for the first non-empty data.table - New changes (from Arun). Explanation below: Even if data.table is empty, as long as there are column names, they should be considered. Ex: What if all data.tables are empty? What'll be the column name then? If there are no names, then the first non-empty set of names will be allocated. I think this is the way to do it...
[ ] "Ragged" files that should have fill = TRUE generate warnings, not errors; as such, we need another way to trigger a STOP() inside fread.c. options(warn=2) isn't enough.
[ ] Add tests of NaN behavior in forder (testing internal behavior since can't compare to base::order)
[x] Think of and add more tests for rbindlist
[x] allow.cartesian is ignored if jsub[1L] has :=; could still warn if i has duplicates
[x] Add comment=="#". Ensure only after last field is observed.
[x] Add integer64 support for transpose
[ ] Add a test when fill=FALSE, but blank.lines.skip=TRUE, when the same effect as in tests 1585 should happen
[x] Fix and add test for cases like this: a,b,c\n1,2,3\n4,5,6\n7,8,9,6\n1,2,3\n... -- extra column in line 4, but we've only detected 3
[x] Add tests for nomatch=NA in non-equi-join suite (tests 1640-ish)... tested, but takes quite some time... so commented for now [split off as #3411]
[x] 0.0 written as 0, but TODO #2398, probably related to the 2 lines after l==0 missing coverage in writeFloat64
[x] Add strict.numeric (default FALSE) to all.equal.data.table()? [split off as #3410]
[ ] Begin to deprecate logicalAsInt argument
[x] #1416 TODO (trailing tabs on most but not at the beginning and a "-" intended to mean missing but taken as text column name)
[ ] Begin deprecation of ..x as valid column names

TO DO items sprinkled throughout the rest of the file:

[ ] the eval() enabled you to use the 'a' in the calling scope, not 'a' in the TESTDT. TO DO: document this.
[ ] Test that changing colnames keep key in sync. TO DO: will have to do this for secondary keys, too, when implemented.
[ ] Test shallow copy verbose message from := adding a column, and (TO DO) only when X is NAMED.
[ ] Used to be warning about invalid .internal.selfref detected and fixed. As from v1.8.3 data.table() returns a NAMED==0 object, and key<- appears not to copy that. But within functions, key<- would still copy. TO DO: add tests....
- [ ] ### ----> Therefore this TO DO may not be necessary here anymore (added by Arun 26th Jan 2014) ---> # TO DO when options(datatable.pedantic=TRUE): test(680.5, rbindlist(l[c(2,6)]), warning="Column 1 of item 2 is type 'double', inconsistent with column 1 of item 1's type ('integer')")
[ ] Test non ascii characters when passed as character by, #2134 TO DO: reinstate. Temporarily removed to pass CRAN's Mac using C locale (R-Forge's Mac is ok)
[ ] TO DO: add warning about inefficiency when datatable.pedantic=TRUE
[ ] TO DO: why doesn't last group have x=d, maybe groups=i in dogroups
[ ] the i groups when no match don't get carried through (would be hard to implement this and very unlikely to be useful. Just break into compound query, if needed to be used in j, to get them to carry through. TO DO: add to FAQ.
[ ] A is needed otherwise read as double with loss of precision (TO DO: should detect and bump to STR)
[ ] \" protected ok, but \ needs taking off too (TO DO)
[ ] Takes too long for routine use. TO DO: move to a long running stress test script
[ ] Selection of columns, copy column to maintain the same as R <= 3.0.2, in Rdevel, for now; otherwise e.g. setkey changes the original columns too. TO DO: could allow shallow copy, perhaps.
[ ] TO DO: roll to -Inf. Also remove -Inf and test rolling to NaN and NA
[ ] rounding of milliseconds, workaround, TO DO: #5445
[ ] TO DO, by checking for balanced embedded quotes
[ ] to do: add tests for :=

MichaelChirico commented 6 years ago

My idea for testing:

use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.

is to run:

t1 = system.time({
    DT = data.table(a = 1:10)
    for (jj in paste0(seq_len(100))) DT[ , (jj) := 10:1]
})

t2 = system.time({
    DT = data.table(a = 1:10)
    for (jj in paste0(seq_len(100))) set(DT, , jj, 10:1)
})

ratio = t1['elapsed']/t2['elapsed']

Then error if ratio is outside (1, R]. On my machine right now, the ratio is about 8; I guess setting R to 10 would be too tight/potentially machine-dependent. How about choosing R = 50? 25?

Could also make the loop more expensive (absolute time of t1 is currently about .025 on my machine -- very fast, but not sure what the right run time is for the test suite)

MichaelChirico commented 6 years ago

@mattdowle the following doesn't appear to be true anymore? unless I missed something about what's intended...

test that ordered subsets when i is unkeyed now retain x's key (using is.sorted(f__))

DT1 = data.table(ID = 1:4, v = 4:1, key = 'ID')
DT2 = data.table(ID = 3:6, w = c(3, 4, 1, 2))
key(DT1[DT2])
# NULL
key(DT1[DT2, on = 'ID'])
# NULL
DT1[DT2]
#    ID  v w
# 1:  3  2 3
# 2:  4  1 4
# 3:  5 NA 1
# 4:  6 NA 2

FWIW, I see that the line is.sorted(f__) has been commented out.

Also, I wasn't able to track down any question on SO on any March 14 mentioning eval or with you as a user (user:403310):

add FAQ that eval() is evaled in calling frame so don't need a, then update SO question of 14 March. See the test using variable name same as column name. Actually, is that true? Need "..J".

MichaelChirico commented 6 years ago

@st-pasha you're more familiar with the fread test suite, is this covered by now?

Test mid read bump of logical T/F to character, collapse back to T and F.

st-pasha commented 6 years ago

I don't remember if there is such a test specifically, but it's easy to add:

DT = data.table(A=rep("True", 2200), B="FALSE")
DT[111, A:="here be"]
DT[111, B:="dragons"]
fwrite(DT, f<-tempfile())
data.table:::test(9999, fread(f, verbose=TRUE), DT, output="Column 2 (\"B\") bumped from 'bool8' to 'string' due to <<dragons>> on row 110")

MichaelChirico commented 6 years ago

@st-pasha thanks, I wasn't sure if there's an exact row number to pick to ensure re-read.

Tried beefing it up, but fread doesn't take "T" "F" to be bool?

DT = data.table(A=rep("True", 2200), B="FALSE", C="T", D='0')
DT[111, LETTERS[1:4] := .("here", "be", "dra", "gons")]
fwrite(DT, f<-tempfile())
fread(f, verbose=TRUE)

MichaelChirico commented 6 years ago

@mattdowle this one appears to be obsolete as well?

check the "j is named list could be inefficient" message from verbose than Chris N showed recently to 15 May

grep "named list" R/* comes up blank.

MichaelChirico commented 6 years ago

re:

tests on double in add hoc by

I am making a test along the lines of:

DT = data.table(a = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), v = 1:12,
                jig = 10^(289:300))
DT[ , sum(v), by = .(a_jiggle = a + jig * .Machine$double.xmin)]
#     a_jiggle V1
#  1:        1  6
#  2:        1  4
#  3:        1  5
#  4:        1  6
#  5:        2  7
#  6:        2  8
#  7:        2  9
#  8:        2 10
#  9:        2 11
# 10:        2 12

I seem to recall it being documented somewhere what exactly the internal limits of accuracy are/how to query this for data.table; on my machine, it appears that roughly disturbances less than 1e-17 don't make a difference... not sure the robust/machine-independent way to design this test, however.

MichaelChirico commented 5 years ago

use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.

@jangorecki this can't make it to the CRAN test suite, but do we have anywhere we could readily drop in such a test (more like a dashboard, really) in dev?

MichaelChirico commented 5 years ago

test and highlight in docs that negatives are fine and fast in forderv (ref R wish #15644)

?forder already mentions the use of - in both setorder and setorderv. If anything the documentation can be cleaned up -- hardly relevant to be pointing out differences of 1.9.2 vs 1.9.4 (i.e. 4.5-5 years ago)... that set ?setorder could go for a tidying

MichaelChirico commented 5 years ago

check the "j is named list could be inefficient" message from verbose than Chris N showed recently to 15 May

grep "named list" R/* returns nothing; I think this is obsolete

MichaelChirico commented 5 years ago

tests on double in add hoc by

IIUC we're not really supposed to be testing things like numerical accuracy. Limitations on double sorting are spelled out in the documentation IIRC... and anyway we haven't seen any issues raised on this in a long time. So I think this is covered.

jangorecki commented 5 years ago

use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.

@jangorecki this can't make it to the CRAN test suite, but do we have anywhere we could readily drop in such a test (more like a dashboard, really) in dev?

There is benchmark.Rraw, or if there is no, there is a plan to have one. This will be proper place for all tests that needs to call system.time(). It should be now easier to add it as summary footer of Rraw script has been moved to test.data.table().

MichaelChirico commented 5 years ago

FWIW, did the following benchmark of set vs. :=:

library(microbenchmark)
library(data.table)

set.seed(5419681)

BB = 100L
times = CJ(rows = 10^(3:6), cols = c(10, 50, 100, 1000), 
           type = c(':=', 'set'), repetition = 1:BB)

for (ii in seq_len(nrow(times))) {
  gc()
  times[ii, duration := switch(type, ':=' = {
    DT = data.table(ID = seq_len(rows))
    t0 = get_nanotime()
    DT[ , paste0('V', 1:cols) := lapply(integer(cols), function(...) 1L)]
    get_nanotime() - t0
  }, 'set' = {
    DT = data.table(ID = seq_len(rows))
    t0 = get_nanotime()
    for (jj in 1:cols) set(DT, , paste0('V', jj), 1L)
    get_nanotime() - t0
  })]
}

times[ , .(ratio = duration[type == ':=']/duration[type == 'set'],
           avg = mean(duration)), keyby = .(rows, cols, repetition)
       ][ , as.list(c(absolute_seconds = mean(avg)/1e9, summary(ratio))), 
          keyby = .(rows, cols)]
#      rows cols absolute_seconds      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
#  1: 1e+03   10     0.0003131304 3.0605254 3.8471289 4.0469033 4.1023044 4.3067497  6.915743
#  2: 1e+03   50     0.0008448426 1.0162382 1.3855990 1.4398864 1.5730537 1.5222325 11.719836
#  3: 1e+03  100     0.0014121146 0.3493482 1.1065934 1.1422921 1.1502817 1.2021676  1.431746
#  4: 1e+03 1000     0.0182234995 0.2997860 0.4154354 0.4314394 0.4375887 0.4445711  0.769559
#  5: 1e+04   10     0.0006716579 1.5684996 1.8859146 1.9747949 1.9763925 2.0480862  2.625070
#  6: 1e+04   50     0.0024221906 0.7199252 1.0999812 1.1368900 1.1405756 1.1763198  2.045857
#  7: 1e+04  100     0.0049328084 0.4010431 1.0242058 1.0517822 1.0688827 1.0850502  2.420760
#  8: 1e+04 1000     0.0540806265 0.5498318 0.7804711 0.8012923 0.8062797 0.8221343  1.134574
#  9: 1e+05   10     0.0038287573 0.8844346 1.1154718 1.1282488 1.1298522 1.1366939  2.021127
# 10: 1e+05   50     0.0280899392 0.7681287 0.9787483 0.9973578 0.9989867 1.0211156  1.274340
# 11: 1e+05  100     0.0600740518 0.5545713 0.9471064 0.9744283 0.9680503 0.9994554  1.086229
# 12: 1e+05 1000     0.4934547127 0.8932713 0.9409418 0.9509392 0.9609168 0.9675374  1.573699
# 13: 1e+06   10     0.0358802545 0.7430620 0.9720593 0.9927718 0.9938076 1.0076720  1.534889
# 14: 1e+06   50     0.2703836096 0.7398572 0.9533737 0.9785185 0.9942344 1.0259876  1.501244
# 15: 1e+06  100     0.4599221823 0.9430767 0.9883373 1.0063509 1.0110553 1.0279786  1.265928
# 16: 1e+06 1000     3.9212144387 0.8491473 0.9790677 0.9865773 0.9899165 0.9928261  1.220000
cat('\nTotal Time:', times[ , sum(duration)/1e9/60], 'minutes\n')
# Total Time: 17.8525 minutes

MichaelChirico commented 5 years ago

Add to warning about a previous copy that class<-, levels<- can also copy whole vector. Any fun<- form basically.

Not sure where it is the intention to put this. Axing barring that.

jangorecki commented 5 years ago

why assign rnorm? it costs good to call gc before each timing start.

MichaelChirico commented 5 years ago

why assign rnorm? it costs

best to just assign TRUE?

will do re:gc

MichaelChirico commented 5 years ago

updated set/:= benchmark, 20x as fast when not using rnorm

jangorecki commented 5 years ago

Results presented above are difficult to interpret. I would find that useful to melt results into something like:

rows, cols, first_set, first_:=, mean_set, mean_:=, median_set, median_:=, relative_first, relative_mean, relative_median

MichaelChirico commented 5 years ago

!make sure explicitly that unnamed lists are being executed by dogroups!

honestly no idea what this is about. IINM dogroups only entryway in R is when computing groups and !GForce here.

Pretty trivial to see that there's plenty of coverage of this in tests already (e.g. some places in #3406 where optimization is turned off, among probably hundreds of others).

Alternative interpretation would be that .ok should be FALSE if j is a named list? Though this is obviously not the case (maybe it was at some point?)

Gonna mark this one done

MichaelChirico commented 5 years ago

tests of freading classes like Date and the verbose messages there.

Marking this complete within this issue since it's covered in a separate PR/issue already

MichaelChirico commented 5 years ago

confirm with Matt the behavior of rbindlist w.r.t. naming of input in edge cases -- rbindlist should look for the first non-empty data.table - New changes (from Arun). Explanation below: Even if data.table is empty, as long as there are column names, they should be considered. Ex: What if all data.tables are empty? What'll be the column name then? If there are no names, then the first non-empty set of names will be allocated. I think this is the way to do it...

@mattdowle any feedback here?

MichaelChirico commented 5 years ago

"Ragged" files that should have fill = TRUE generate warnings, not errors; as such, we need another way to trigger a STOP() inside fread.c. options(warn=2) isn't enough.

@mattdowle ditto here... could you split off into a new issue if this is still a concern?

MichaelChirico commented 5 years ago

# TO DO: TODO: think of and add more tests for rbindlist

This is quite vague and 5 years old... closing as duplicate of the 40+ rbindlist issues outstnding

MichaelChirico commented 5 years ago

allow.cartesian is ignored if jsub[1L] has :=; could still warn if i has duplicates

AFAICT this is subsumed by #3077; closing here

MichaelChirico commented 5 years ago

Add comment=="#". Ensure only after last field is observed.

Presumably this is just #856; closing here

MichaelChirico commented 5 years ago

Add integer64 support for transpose

This is a full issue; filed as #3408 and closing here

MichaelChirico commented 5 years ago

Add a test when fill=FALSE, but blank.lines.skip=TRUE, when the same effect should happen

Unclear what "same effect" is referring to; I see tests 1584.x that seem to cover this case? @arunsrinivasan this is under you per git blame (admittedly 3+ years ago), any idea?

MichaelChirico commented 5 years ago

Fix and add test for cases like this: a,b,c\n1,2,3\n4,5,6\n7,8,9,6\n1,2,3\n... -- extra column in line 4, but we've only detected 3

This seems to be handled already:

s = "a,b,c
1,2,3
4,5,6
7,8,9,6
1,2,3"

fread(s)
#        a     b     c
#    <int> <int> <int>
# 1:     1     2     3
# 2:     4     5     6

Warning message: In fread(s) : Stopped early on line 4. Expected 3 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<7,8,9,6>>

fread(s, fill=TRUE)
#        a     b     c    V4
#    <int> <int> <int> <int>
# 1:     1     2     3    NA
# 2:     4     5     6    NA
# 3:     7     8     9     6
# 4:     1     2     3    NA

MichaelChirico commented 5 years ago

Add tests for nomatch=NA in non-equi-join suite (tests 1640-ish)... tested, but takes quite some time... so commented for now

@jangorecki this set of tests could be added to the GitLab suite?

Is the TODO covered by the commented tests?

# TODO: add tests for nomatch=NA..
# tested, but takes quite some time.. so commenting for now
# nqjoin_test(x, y, 3L,1643.0)
# nqjoin_test(dt1,dt2,3L,1652.0)

# nqjoin_test(  x,dt2,1L,1644.0) # without NA only in x
# nqjoin_test(  x,dt2,2L,1645.0)
# nqjoin_test(  x,dt2,3L,1646.0)
# nqjoin_test(dt1,  y,1L,1647.0) # without NA only in i
# nqjoin_test(dt1,  y,2L,1648.0)
# nqjoin_test(dt1,  y,3L,1649.0)

MichaelChirico commented 5 years ago

0.0 written as 0, but TODO #2398, probably related to the 2 lines after l==0 missing coverage in writeFloat64

As stated, this is just #2398

MichaelChirico commented 5 years ago

Add strict.numeric (default FALSE) to all.equal.data.table()?

@jangorecki can revisit this and file as a separate issue if still viable? But I guess no because of the difficulty in testing identical for numeric across platforms? Or could include it but discourage it "use at your own risk"?

MichaelChirico commented 5 years ago

Begin to deprecate logicalAsInt argument

This was 1.11.0 (May 2018); can begin the deprecation process with next release?

MichaelChirico commented 5 years ago

#1416 TODO (trailing tabs on most but not at the beginning and a "-" intended to mean missing but taken as text column name)

As stated, this is just #1416

MichaelChirico commented 5 years ago

Begin deprecation of ..x as valid column names

This is also 1.11.0 (May 2018). But see #3321, making things more complicated...

MichaelChirico commented 6 months ago

Same as #678 -- better to handle this with a linter

Rdatatable / data.table

TODO items from tests.Rraw #2572