Closed Bisaloo closed 2 years ago
Ah, I was so slow to post this issue that there is already a fix: https://github.com/epiforecasts/scoringutils/pull/139.
Good to have the problem laid out in detail, though - I chose to make everything NaN
but it sounds like we're better of with NA
.
For what it's worth the problem in our case was:
> dt <- data.table(id = c(rep(1, 3), rep(2, 3), rep(3, 3)), a = c(as.vector(replicate(3, c(NA_real_, rep(rnorm(1), 2))))))
> dt[, a := unique(na.omit(a)), by = id]
> unique(dt)
id a
1: 1 0.05785241
2: 2 0.05029081
3: 3 0.57708069
> dt <- data.table(id = c(rep(1, 3), rep(2, 3), rep(3, 3)), a = c(as.vector(replicate(2, c(NA_real_, rep(rnorm(1), 2)))), c(NA_real_, rep(NaN, 2))))
> dt[, a := unique(na.omit(a)), by = id]
> unique(dt)
id a
1: 1 0.09825976
2: 2 -1.06038560
3: 3 NA
4: 3 NaN
When saving to csv the second data table leads to a duplicated row.
Hm wouldn't it be best to assign NA_real
to theta
in #139?
It has been changed after (but directly to master
): https://github.com/epiforecasts/scoringutils/commit/24b1efc083ea0d4096855f1b27dba9c1d7a98052
ah. perfect!
Reprex
Created on 2021-11-10 by the reprex package (v2.0.1.9000)
As you can see here, we have two almost identical lines for
EuroCOVIDhub-ensemble
excepted that one is withNA
and the other withNaN
. The fact that a model can have more than one row cause issues in downstream analyses in our cases.Description of the problem
It looks like
pairwise_comparison()
sometimes returnsNA
and sometimesNaN
when it cannot compute the value. This leads to confusion because bothNA
andNaN
indicate almost the same thing but they have strange incompatibilites:Created on 2021-11-10 by the reprex package (v2.0.1.9000)
Proposed solution
NaN
is rarely used and confusing (IMO). They often appear because the function doesn't control the output for errors/impossible computations. And they can cause serious issues in downstream analyses (such as in our case). Tomas Kalibera gives a good overview of the hell that isNA
vsNaN
:So we should stick to only one of these. As far as I know,
NA
always propagates asNA
whilef(NaN)
can returnNA
orNaN
depending onf()
so a conscious choice to always outputNA
would be much better / clearer IMO.