Open nikosbosse opened 1 month ago
@Bisaloo do you think you might have a minute to look at this, or are you swamped at the moment? In that case I can try and take over. Thank you!
For context: we have to update scoringutils on CRAN until October 31, 2024. The reason for that is that the package author fields are currently malformed in the CRAN version (they are fixed on gh) and we got a few messages asking us to fix this by October 31 latest. Ideally, it would be nice to have this fixed before we release anything to CRAN.
see #920
@seabbs tagging you for context on the deadline
For context: we have to update scoringutils on CRAN until October 31, 2024. The reason for that is that the package author fields are currently malformed in the CRAN version (they are fixed on gh) and we got a few messages asking us to fix this by October 31 latest. Ideally, it would be nice to have this fixed before we release anything to CRAN.
Ah I see. So we could just update the CRAN version from the release that we made it with (i.e update that branch and release from there). That takes the pressure off the very breaking dev version but also deadlines are nice
Exactly. But I think we could actually make the October 31 deadline. Only 3 more issues to got! https://github.com/epiforecasts/scoringutils/issues?q=is:issue+is:open+-milestone:scoringutils-2.x
Edit: SHOULD!
I can have a look after Wednesday. I don't have any intuition of what may be causing this yet :thinking:
Merci!
From what I understand, data.table has a complex internal state to control whether or not an object should be printed. The problem is that this state is stored in the .global
internal variable (https://github.com/Rdatatable/data.table/commit/342d6ba2cd87e4673ac6c3a2ae65a8b731a45cd1) which we cannot touch or even inspect from scoringutils :thinking:
scoringutils could potentially add its own internal variable to determine if printing should proceed (i.e., forwarded to print.data.table()
) or cancelled but the logic in data.table at least is quite convoluted.
I see. Hm... @MichaelChirico might you have an idea for this?
Sorry, I'm just glancing & not able to glean the full context. Do you have something that's printing, that shouldn't be, or something that's not printing, that should be?
Things are printing that shouldn't be printing. This works fine:
a <- data.table(x = 1:30, y = 2:31)
a[, model := "a"]
This prints although it shouldn't:
ex <- data.table::copy(example_quantile)
ex[, model := paste(model, "a")]
Same happens with scores
objects
ex <- score(example_quantile)
ex[, test := 3]
This is probably related to the fact that we have specific [.forecast
and [.scores
methods that revalidate objects when the user manipulates them.
In the data.table
code you linked @Bisaloo I see this comment in the code
This lets me think that the issue may be related to the fact that we're hitting [.forecast
twice. If I run this code:
library(scoringutils)
debug(scoringutils:::`[.forecast`)
ex <- data.table::copy(example_quantile)
ex[, model := paste(model, "a")]
I'm landing at [.forecast
twice. This does not seem to be due to the content of assert_forecast()
. Replacing
validation <- try(
assert_forecast(forecast = out, verbose = FALSE),,
silent = TRUE
)
with
validation <- try(
TRUE,
silent = TRUE
)
in [.forecast
doesn't change the fact that we're hitting [.forecast
twice...
Hm. Planning ahead: If we're not able to solve this before October 31, I see two and a half options:
[.forecast
and [.scores
functions until this is fixed. print.forecast
might also be a temporary solutionI'm leaning towards 2) and would like to avoid 3). Ideal solution would of course be fixing the issue until then. I'm a bit lost what's exactly causing this...
ok it seems like an alternative temporary solution would be to get rid of the custom print
method.
I created a reprex to simplify what's going on.
library(data.table)
`[.foo` <- function(x, ...) {
out <- NextMethod()
validate_x_again <- "ok"
return(out)
}
print.foo <- function(x, ...) {
print("foo")
NextMethod()
return(invisible(x))
}
ex <- data.table(x = 1:10)
setattr(ex, "class", c("foo", "data.table", "data.frame"))
ex[, y := 3]
ok digging deeper it seems like actually the problem is with the print
method. (I don't know why we had the same issue with scores
objects as well, but that doesn't seem to be the case anymore).
There is this issue from 2018 mentioning the exact same problem: https://github.com/Rdatatable/data.table/issues/3029.
I'm slightly confused we didn't notice this earlier (as far as I can tell we had the print
method a lot longer than Hugo's [.forecast
function.
In complete ignorance of what's actually happening I made some progress by including the following part from print.data.table
in the print function:
shouldPrint <- getAnywhere("shouldPrint")$objs[[1]]
# the following lines were taken from data.tables print.data.table
# https://github.com/Rdatatable/data.table/blob/058dd4d51ef4a70aef3b913e556b4ab1a150d1c9/R/print.data.table.R#L24C3-L41C4
# ----------------------------------------------------------------------------
# := in [.data.table sets .global$print=address(x) to suppress the next print i.e., like <- does. See FAQ 2.22 and README item in v1.9.5
# The issue is distinguishing "> DT" (after a previous := in a function) from "> DT[,foo:=1]". To print.data.table(), there
# is no difference. Now from R 3.2.0 a side effect of the very welcome and requested change to avoid silent deep copy is that
# there is now no longer a difference between > DT and > print(DT). So decided that DT[] is now needed to guarantee print; simpler.
# This applies just at the prompt. Inside functions, print(DT) will of course print.
# Other options investigated (could revisit): Cstack_info(), .Last.value gets set first before autoprint, history(), sys.status(),
# topenv(), inspecting next statement in caller, using clock() at C level to timeout suppression after some number of cycles
SYS <- sys.calls()
if (!shouldPrint(x)) {
SYS = sys.calls()
if (length(SYS) <= 2L || # "> DT" auto-print or "> print(DT)" explicit print (cannot distinguish from R 3.2.0 but that's ok)
( length(SYS) >= 3L && is.symbol(thisSYS <- SYS[[length(SYS)-2L]][[1L]]) &&
as.character(thisSYS) == 'source') || # suppress printing from source(echo = TRUE) calls, #2369
( length(SYS) > 3L && is.symbol(thisSYS <- SYS[[length(SYS)-3L]][[1L]]) &&
as.character(thisSYS) %chin% mimicsAutoPrint ) ) {
return(invisible(x))
# is.symbol() temp fix for #1758.
}
}
At leat in the simple example above this seems to work, i.e. the following behaves correctly:
library(data.table)
print.foo <- function(x, ...) {
SYS <- sys.calls()
shouldPrint <- getAnywhere("shouldPrint")$objs[[1]]
if (!shouldPrint(x)) {
SYS = sys.calls()
if (length(SYS) <= 2L ||
( length(SYS) >= 3L && is.symbol(thisSYS <- SYS[[length(SYS)-2L]][[1L]]) &&
as.character(thisSYS) == 'source') ||
( length(SYS) > 3L && is.symbol(thisSYS <- SYS[[length(SYS)-3L]][[1L]]) &&
as.character(thisSYS) %chin% mimicsAutoPrint ) ) {
return(invisible(x))
}
}
print("Additional info: foo")
NextMethod()
return(invisible(x))
}
ex <- data.table(x = 1:10)
setattr(ex, "class", c("foo", "data.table", "data.frame"))
ex[, y := 3]
ex
In the package itself it does not work again which again may have to do with the fact that [.forecast
is called twice and shouldPrint()
doesn't recognise that or something something...
Wow lots here.
Firstly I agree with "alternative temporary solution would be to get rid of the custom print method." as the option to go with to avoid us getting blocked if we cant resolve.
complete ignorance
I also have no idea what is happening here so guess its good it helps a bit!
I don't currently have a lot of useful insights for this unfortunately.
Digging deeper again it seems like the problem is not only with the print method. Even if we remove the print method, the problem still persists.
For example, this line in get_duplicate_forecasts()
also triggers a print:
data[, scoringutils_InternalDuplicateCheck := .N, by = c(forecast_unit, type)]
But this is performed on a data.table
object, not a forecast
object.
I will try and write up what I did exactly a bit later. But the tldr is: I wasn't able to find an easy fix.
I think our short-term options:
Both seem not ideal. I'm leaning slightly towards option 2.
It has to be 2. of those but yeah very much not ideal.
Urgh!!11!
I discovered a new issue:
debugonce(assert_forecast)
example_nominal
If assert_forecast
is triggered by - whatever is triggering it, I don't even know which I think is part of the problem - then only 10 rows are passed on to assert_forecast()
. For most forecast types that's fine, but for nominal forecasts this means that the validation fails immediately....
ok I think [.forecast
is triggered from within print.data.table()
where the dt is subset and only the first and last five rows are returned.
I think we need to (temporarily, I hope) remove some of the [.forecast
functionality until just printing example_nominal
doesn't fail anymore
I think that we can't fix the printing issue. Digging into the data.table
code (and as @nikosbosse states above) this relies on a global variable in the package environment being set to the object address to let data.table
know not to print the next data table that comes along.
This, we could perhaps address by cleverly manipulating the data.table
package environment. However, there is some additional logic to avoid suppressing printing in certain instances. In particular, it expects the call stack to be of length <=2 (or containing specific strings) in
In our case when called from print.forecast()
we get a call stack of length 4
[[1]]
(function (x, ...)
UseMethod("print"))(x)
[[2]]
print.forecast(x)
[[3]]
NextMethod()
[[4]]
print.data.table(x)
So the only option I can see would be to re-create the logic in data.table
with our own global variable, checking the call stack etc. Perhaps better to live with printing upon calling :=
?
Yeah it seems quite difficult... Maybe we should actually just use print.data.frame
or maybe even a tibble version for printing...
I think we'll have to keep thinking about this after the CRAN release. For now I temporarily deleted the check that got triggered by subsetting a forecast
Maybe we should actually just use
print.data.frame
or maybe even a tibble version for printing...
We could but of course we’d still get prints with :=
.
Argh. True... (empirically, I just tested it). But I'm not sure it always has to be the case. Hm....
My understanding is that because using [
with :=
returns a data.table
it'll always call print
unless assigned to something. The special thing in print.data.table
is that it has a mechanism for suppressing printing under certain conditions - which we can't meet.
Consider this example:
Upon changing the model column, the first data.table prints nothing. For
ex
, however, the whole data.table is printed again.EDIT: This is no longer true Same happens with
scores
objects@Bisaloo do you have any thoughts on this?