epiforecasts / scoringutils

Utilities for Scoring and Assessing Predictions
https://epiforecasts.io/scoringutils/
Other
48 stars 21 forks source link

`forecast` and `scores` objects print output every time a column is changed #935

Open nikosbosse opened 1 month ago

nikosbosse commented 1 month ago

Consider this example:

a <- data.table(x = 1:30, y = 2:31)
a[, model := "a"]

ex <- data.table::copy(example_quantile)
ex[, model := paste(model, "a")]

Upon changing the model column, the first data.table prints nothing. For ex, however, the whole data.table is printed again.


EDIT: This is no longer true Same happens with scores objects

ex <- score(example_quantile)
ex[, test := 3]

@Bisaloo do you have any thoughts on this?

nikosbosse commented 1 month ago

@Bisaloo do you think you might have a minute to look at this, or are you swamped at the moment? In that case I can try and take over. Thank you!

For context: we have to update scoringutils on CRAN until October 31, 2024. The reason for that is that the package author fields are currently malformed in the CRAN version (they are fixed on gh) and we got a few messages asking us to fix this by October 31 latest. Ideally, it would be nice to have this fixed before we release anything to CRAN.

see #920

nikosbosse commented 1 month ago

@seabbs tagging you for context on the deadline

seabbs commented 1 month ago

For context: we have to update scoringutils on CRAN until October 31, 2024. The reason for that is that the package author fields are currently malformed in the CRAN version (they are fixed on gh) and we got a few messages asking us to fix this by October 31 latest. Ideally, it would be nice to have this fixed before we release anything to CRAN.

Ah I see. So we could just update the CRAN version from the release that we made it with (i.e update that branch and release from there). That takes the pressure off the very breaking dev version but also deadlines are nice

nikosbosse commented 1 month ago

Exactly. But I think we could actually make the October 31 deadline. Only 3 more issues to got! https://github.com/epiforecasts/scoringutils/issues?q=is:issue+is:open+-milestone:scoringutils-2.x

Edit: SHOULD!

Bisaloo commented 1 month ago

I can have a look after Wednesday. I don't have any intuition of what may be causing this yet :thinking:

nikosbosse commented 1 month ago

Merci!

Bisaloo commented 1 month ago

From what I understand, data.table has a complex internal state to control whether or not an object should be printed. The problem is that this state is stored in the .global internal variable (https://github.com/Rdatatable/data.table/commit/342d6ba2cd87e4673ac6c3a2ae65a8b731a45cd1) which we cannot touch or even inspect from scoringutils :thinking:

scoringutils could potentially add its own internal variable to determine if printing should proceed (i.e., forwarded to print.data.table()) or cancelled but the logic in data.table at least is quite convoluted.

nikosbosse commented 1 month ago

I see. Hm... @MichaelChirico might you have an idea for this?

MichaelChirico commented 1 month ago

Sorry, I'm just glancing & not able to glean the full context. Do you have something that's printing, that shouldn't be, or something that's not printing, that should be?

nikosbosse commented 1 month ago

Things are printing that shouldn't be printing. This works fine:

a <- data.table(x = 1:30, y = 2:31)
a[, model := "a"]

This prints although it shouldn't:

ex <- data.table::copy(example_quantile)
ex[, model := paste(model, "a")]

Same happens with scores objects

ex <- score(example_quantile)
ex[, test := 3]

This is probably related to the fact that we have specific [.forecast and [.scores methods that revalidate objects when the user manipulates them.

nikosbosse commented 1 month ago

In the data.table code you linked @Bisaloo I see this comment in the code

image

This lets me think that the issue may be related to the fact that we're hitting [.forecast twice. If I run this code:

library(scoringutils)
debug(scoringutils:::`[.forecast`)
ex <- data.table::copy(example_quantile)
ex[, model := paste(model, "a")]

I'm landing at [.forecast twice. This does not seem to be due to the content of assert_forecast(). Replacing

validation <- try(
      assert_forecast(forecast = out, verbose = FALSE),,
      silent = TRUE
    )

with

validation <- try(
      TRUE,
      silent = TRUE
    )

in [.forecast doesn't change the fact that we're hitting [.forecast twice...

nikosbosse commented 1 month ago

Hm. Planning ahead: If we're not able to solve this before October 31, I see two and a half options:

I'm leaning towards 2) and would like to avoid 3). Ideal solution would of course be fixing the issue until then. I'm a bit lost what's exactly causing this...

nikosbosse commented 1 month ago

ok it seems like an alternative temporary solution would be to get rid of the custom print method.

I created a reprex to simplify what's going on.

library(data.table)

`[.foo` <- function(x, ...) {
  out <- NextMethod()
  validate_x_again <- "ok"
  return(out)
}

print.foo <- function(x, ...) {
  print("foo")
  NextMethod()
  return(invisible(x))
}

ex <- data.table(x = 1:10)
setattr(ex, "class", c("foo", "data.table", "data.frame"))
ex[, y := 3]
nikosbosse commented 1 month ago

ok digging deeper it seems like actually the problem is with the print method. (I don't know why we had the same issue with scores objects as well, but that doesn't seem to be the case anymore).

There is this issue from 2018 mentioning the exact same problem: https://github.com/Rdatatable/data.table/issues/3029.

I'm slightly confused we didn't notice this earlier (as far as I can tell we had the print method a lot longer than Hugo's [.forecast function.

nikosbosse commented 1 month ago

In complete ignorance of what's actually happening I made some progress by including the following part from print.data.table in the print function:

shouldPrint <- getAnywhere("shouldPrint")$objs[[1]]
  # the following lines were taken from data.tables print.data.table
  # https://github.com/Rdatatable/data.table/blob/058dd4d51ef4a70aef3b913e556b4ab1a150d1c9/R/print.data.table.R#L24C3-L41C4
  # ----------------------------------------------------------------------------
  #  := in [.data.table sets .global$print=address(x) to suppress the next print i.e., like <- does. See FAQ 2.22 and README item in v1.9.5
  # The issue is distinguishing "> DT" (after a previous := in a function) from "> DT[,foo:=1]". To print.data.table(), there
  # is no difference. Now from R 3.2.0 a side effect of the very welcome and requested change to avoid silent deep copy is that
  # there is now no longer a difference between > DT and > print(DT). So decided that DT[] is now needed to guarantee print; simpler.
  # This applies just at the prompt. Inside functions, print(DT) will of course print.
  # Other options investigated (could revisit): Cstack_info(), .Last.value gets set first before autoprint, history(), sys.status(),
  #   topenv(), inspecting next statement in caller, using clock() at C level to timeout suppression after some number of cycles
  SYS <- sys.calls()
  if (!shouldPrint(x)) {
    SYS = sys.calls()
    if (length(SYS) <= 2L ||  # "> DT" auto-print or "> print(DT)" explicit print (cannot distinguish from R 3.2.0 but that's ok)
        ( length(SYS) >= 3L && is.symbol(thisSYS <- SYS[[length(SYS)-2L]][[1L]]) &&
          as.character(thisSYS) == 'source') || # suppress printing from source(echo = TRUE) calls, #2369
        ( length(SYS) > 3L && is.symbol(thisSYS <- SYS[[length(SYS)-3L]][[1L]]) &&
          as.character(thisSYS) %chin% mimicsAutoPrint ) )  {
      return(invisible(x))
      # is.symbol() temp fix for #1758.
    }
  }

At leat in the simple example above this seems to work, i.e. the following behaves correctly:

library(data.table)
print.foo <- function(x, ...) {
  SYS <- sys.calls()
  shouldPrint <- getAnywhere("shouldPrint")$objs[[1]]
  if (!shouldPrint(x)) {
    SYS = sys.calls()
    if (length(SYS) <= 2L ||
        ( length(SYS) >= 3L && is.symbol(thisSYS <- SYS[[length(SYS)-2L]][[1L]]) &&
          as.character(thisSYS) == 'source') ||
        ( length(SYS) > 3L && is.symbol(thisSYS <- SYS[[length(SYS)-3L]][[1L]]) &&
          as.character(thisSYS) %chin% mimicsAutoPrint ) )  {
      return(invisible(x))
    }
  }

  print("Additional info: foo")
  NextMethod()
  return(invisible(x))
}

ex <- data.table(x = 1:10)
setattr(ex, "class", c("foo", "data.table", "data.frame"))
ex[, y := 3]
ex

In the package itself it does not work again which again may have to do with the fact that [.forecast is called twice and shouldPrint() doesn't recognise that or something something...

seabbs commented 1 month ago

Wow lots here.

Firstly I agree with "alternative temporary solution would be to get rid of the custom print method." as the option to go with to avoid us getting blocked if we cant resolve.

complete ignorance

I also have no idea what is happening here so guess its good it helps a bit!

I don't currently have a lot of useful insights for this unfortunately.

nikosbosse commented 1 month ago

Digging deeper again it seems like the problem is not only with the print method. Even if we remove the print method, the problem still persists.

For example, this line in get_duplicate_forecasts() also triggers a print:

data[, scoringutils_InternalDuplicateCheck := .N, by = c(forecast_unit, type)]

But this is performed on a data.table object, not a forecast object.

I will try and write up what I did exactly a bit later. But the tldr is: I wasn't able to find an easy fix.

I think our short-term options:

Both seem not ideal. I'm leaning slightly towards option 2.

seabbs commented 1 month ago

It has to be 2. of those but yeah very much not ideal.

nikosbosse commented 4 weeks ago

Urgh!!11!

I discovered a new issue:

debugonce(assert_forecast)
example_nominal

If assert_forecast is triggered by - whatever is triggering it, I don't even know which I think is part of the problem - then only 10 rows are passed on to assert_forecast(). For most forecast types that's fine, but for nominal forecasts this means that the validation fails immediately....

nikosbosse commented 4 weeks ago

ok I think [.forecast is triggered from within print.data.table() where the dt is subset and only the first and last five rows are returned.

nikosbosse commented 3 weeks ago

I think we need to (temporarily, I hope) remove some of the [.forecast functionality until just printing example_nominal doesn't fail anymore

sbfnk commented 3 weeks ago

I think that we can't fix the printing issue. Digging into the data.table code (and as @nikosbosse states above) this relies on a global variable in the package environment being set to the object address to let data.table know not to print the next data table that comes along.

This, we could perhaps address by cleverly manipulating the data.table package environment. However, there is some additional logic to avoid suppressing printing in certain instances. In particular, it expects the call stack to be of length <=2 (or containing specific strings) in

https://github.com/Rdatatable/data.table/blob/afa87e1eeb61922cdef512fefd86e3ba5847eef9/R/print.data.table.R#L33

In our case when called from print.forecast() we get a call stack of length 4

[[1]]
(function (x, ...) 
UseMethod("print"))(x)

[[2]]
print.forecast(x)

[[3]]
NextMethod()

[[4]]
print.data.table(x)

So the only option I can see would be to re-create the logic in data.table with our own global variable, checking the call stack etc. Perhaps better to live with printing upon calling :=?

nikosbosse commented 3 weeks ago

Yeah it seems quite difficult... Maybe we should actually just use print.data.frame or maybe even a tibble version for printing... I think we'll have to keep thinking about this after the CRAN release. For now I temporarily deleted the check that got triggered by subsetting a forecast

sbfnk commented 3 weeks ago

Maybe we should actually just use print.data.frame or maybe even a tibble version for printing...

We could but of course we’d still get prints with :=.

nikosbosse commented 3 weeks ago

Argh. True... (empirically, I just tested it). But I'm not sure it always has to be the case. Hm....

sbfnk commented 3 weeks ago

My understanding is that because using [ with := returns a data.table it'll always call print unless assigned to something. The special thing in print.data.table is that it has a mechanism for suppressing printing under certain conditions - which we can't meet.