Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

Don't "double-print" in output= tests #6188

Open MichaelChirico opened 5 months ago

MichaelChirico commented 5 months ago

Whenever output= is used in test(), we force print(x):

https://github.com/Rdatatable/data.table/blob/f5a1e09328fef5f8a4c5e225bc5b75ca09700f76/R/test.data.table.R#L425

We can consider not doing so if x already contains a call to print(), e.g. these 27 cases:

library(lintr)
l <- make_linter_from_xpath(
  "//SYMBOL_FUNCTION_CALL[text()='test']
    /parent::expr
    /following-sibling::expr[expr/SYMBOL_FUNCTION_CALL[text()='print']]
    /following-sibling::SYMBOL_SUB[text()='output']",
  "xxx"
)
lint_dir("inst/tests", l(), pattern="Rraw") |>
  as.data.frame() |>
  _[, "line"] |>
  writeLines()
test(3, print(DT), output=c("   x             y", "1: 1 <ex_class[3]>",     "2: 2 <ex_class[3]>"))
test(779, print(DT[,sum(a),by=b]), output="   b V1\n1: a  3\n2: b  3$")
test(1006, print(as.data.table(M), nrows=10), output="gear NA.*1: 21.0")
    test(1037.103, print(ans[c(1,1326)]), output="state.*income.*retailprice.*percent_15_19.*beercons.*variable.*value.*1.*9.6.*89.34.*smoking88.*smoking00.*41.6")
  test(1037.301, print(melt(DT, id.vars="month", verbose=TRUE)), output="'measure.vars' is missing.*Assigned 'measure.vars' are [[]Record high, Average high, Daily mean, Average low, ...[]].*1:.*Jan.*Record high.*12.8.*108:.*Sep.*sunshine hours.*174.1")
test(1056, print(DT, digits=2), output="   x       y logy\n1: a      18  2.9\n2: a      79  4.4\n3: a     337  5.8")
test(1057, print(DT, digits=2, big.mark=","), output="   x         y logy\n1: a        18  2.9.*6: b    26,556 10.2\n7: c   113,811 11.6")
test(1636.2, print(dt), output="1: 1 <function[1]>")
test(1636.4, print(dt), output="2: 2 <function[1]>")
test(1636.6, print(dt), output="2: 2 <function[1]>")
test(1765.1, print(IDateTime(x)), output=".*idate.*itime.*1: 2017-03-1[67]",
test(1893.1, print(DT), output=txt<-c("   A    B", "1: 1  FOO", "2: 2     ", "3: 3 <NA>", "4: 4   NA"))
test(1893.2, print(DF), output=txt)
test(1893.3, print(fread(txt,na.strings="")), output="A    B\n1: 109   MT\n2:   7    N\n3:  11   NA\n4:  41   NB\n5:  60   ND\n6:   1     \n7:   2 <NA>\n8:   3   NA\n9:   4   NA")
test(2015.4, print(DT), output="V1.*1:[ ]+<NA>.*2:[ ]+1.*101:[ ]+100.*102:[ ]+<NA>")
test(2015.5, print(DT), output="V1.*1:[ ]+<NA>.*2:[ ]+1.*4:[ ]+3.*5:[ ]+<NA>")
test(2089.3, print(DF), output="<multi-column>")
test(2114.4, print(res), output="date.*-0.830")
test(2130.01, print(DT), output=c("   x                 y", "1: 1 <data.table[1x2]>", "2: 2 <data.table[1x2]>"))
test(2130.02, print(DT), output=c("   x         y", "1: 1 <list[2]>",         "2: 2 <list[2]>"))
test(2130.101, print(DT, timezone=TRUE), output='UTC')
test(2130.11, print(data.table(e = expression(1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13))), output = '1 + 2 + 3')
test(2130.13, print(DT), output="myclass2130:foo.*myclass2130:bar")
test(2130.14, print(DT), output="myclass2130:foo.*myclass2130:bar") # because length 1 from format but needs to be length(x)
test(2130.15, print(DT), output="All hail foo")  # e.g. sf:::format.sfc rather than sf:::format.sfg on each item
test(2130.16, print(DT), output="myclass2130:foo.*myclass2130:bar")
test(2222, print(DT), output="A.*3")

Note that cases like print(DT), output=... could be linted into just DT, output=..., but cases like print(DT, ...) with non-default options need explicit print() invocation.

OTOH, maybe it's overkill / won't make the code any easier to understand to pursue this.

Nj221102 commented 5 months ago

Should we remove Print() from the already existing tests or this advice is just for the future use?

MichaelChirico commented 4 months ago

Honestly not sure. I think leaving it in existing tests is fine. This is about if we can avoid running print() "under the hood" in test() when output= is provided and x= is a call to print(). Not sure it will actually improve the implementation at all, just a thought that it's odd to actually wind up running print() twice.

joshhwuu commented 3 months ago

IMO, it would be healthy to only print once -- I vaguely recall having trouble finding small differences when testing with print and output with large tables, as the test failure would "expect" a double print, which certainly looks worse than it really is... A small example I was able to make here:

test(2279.1, print(data.table(a = 1)), output="   a\n1: 2")

Test 2279.1 did not produce correct output:
  Expected: <<   a\n1: 2>>
  Observed: <<   a\n1: 1\n   a\n1: 1>>

It makes things a little harder to read as content gets bigger, requiring the reader to scroll horizontally a bit more.

Anyway, I believe separating not calling print twice when x already contains a call to print should be pretty easy, I'd be happy to write up a quick PR. (Also there may be a slight speed improvement for R CMD Check by calling print less? Although this is a constant and probably wouldn't matter in the big picture, might do some benchmarking)

MichaelChirico commented 3 months ago

Great motivating example. For now, I have #6266 in mind -- even if we fix the existing test scripts, it will be easy for them to backslide unless we have simple static checks in place. I also have #5830 and #5845 in mind -- while making this change does have the nice benefits you cite, it's not particularly urgent & will generate some git conflict churn in the meantime. So I would still prioritize burning down the PR queue first.

joshhwuu commented 3 months ago

while making this change does have the nice benefits you cite, it's not particularly urgent & will generate some git conflict churn in the meantime. So I would still prioritize burning down the PR queue first.

Right, this makes sense to me. If someone (probably future me) ever decides to pursue this, here's what I did to evaluate x without printing if it already contains a call to print:

# test.data.table.R
if (any(grepl("^print\\b", deparse(xsub)))) {
      out = capture.output(x <- suppressMessages(withCallingHandlers(tryCatch(x, error=eHandler), warning=wHandler, message=mHandler)))
    } else {
      out = capture.output(print(x <- suppressMessages(withCallingHandlers(tryCatch(x, error=eHandler), warning=wHandler, message=mHandler))))
    }