evalclass / precrec

An R library for accurate and fast calculations of Precision-Recall and ROC curves
https://evalclass.github.io/precrec
GNU General Public License v3.0
45 stars 5 forks source link

fortify gives wrong order of dsid_modnames #18

Closed csangara closed 2 years ago

csangara commented 2 years ago

I noticed that when using fortify on a mmcurves object with raw_curves=TRUE, the _dsidmodname column is not a concatenation of the dsid and modname columns. Test case shown below.

library(precrec)
library(ggplot2)

# Sample with 5 test sets and 3 models with 20 data points per test set
test <- create_sim_samples(5, 10, 10, c("random", "poor_er", "good_er"))

# Evaluate models
mmcurves <- evalmod(scores = test$scores, labels = test$labels,
                    modnames = test$modnames, dsids = test$dsids,
                    raw_curves=TRUE)

# Convert to dataframe
mmcurves_df <- subset(fortify(mmcurves, raw_curves = TRUE), curvetype=="PRC") 

# Check order of modname, dsid, and dsid_modname
> unique(paste0(mmcurves_df$modname, ":", mmcurves_df$dsid))
 [1] "random:1"  "poor_er:1" "good_er:1" "random:2"  "poor_er:2" "good_er:2" "random:3"  "poor_er:3"
 [9] "good_er:3" "random:4"  "poor_er:4" "good_er:4" "random:5"  "poor_er:5" "good_er:5"
> unique(mmcurves_df$dsid_modname)
 [1] random:1  random:2  random:3  random:4  random:5  poor_er:1 poor_er:2 poor_er:3 poor_er:4
[10] poor_er:5 good_er:1 good_er:2 good_er:3 good_er:4 good_er:5
takayasaito commented 2 years ago

Thank you for notifying us of this bug. We fixed the part where dsid_modnames was incorrectly generated in etc_utils_dataframe.R as follows.

  # Make dsis-modname pairs
  dsid_modnames <- paste(rep(uniq_modnames, length(uniq_dsids)),
                         rep(uniq_dsids, each=length(uniq_modnames)), sep=":")

We are going to submit a new version that includes this bug fix to CRAN soon.

csangara commented 2 years ago

Thank you for the quick reply! However, I think that fix still causes the same issue if the modnames are random*5, poor_er*5, and good_er*5 instead of alternating between the three.

# Create samples, but now change the order of models
test <- create_sim_samples(5, 10, 10, c("random", "poor_er", "good_er"))
test$modnames <- rep(c("random", "poor_er", "good_er"), each=5)
test$dsids <- rep(1:5, 3)

# Evaluate models
mmcurves <- evalmod(scores = test$scores, labels = test$labels,
                    modnames = test$modnames, dsids = test$dsids,
                    raw_curves=TRUE)

# Convert to dataframe
mmcurves_df <- subset(fortify(mmcurves, raw_curves = TRUE), curvetype=="PRC") 

# dsid_modname is the wrong order in this case
> unique(paste0(mmcurves_df$modname, ":", mmcurves_df$dsid))
 [1] "random:1"  "random:2"  "random:3"  "random:4"  "random:5"  "poor_er:1"
 [7] "poor_er:2" "poor_er:3" "poor_er:4" "poor_er:5" "good_er:1" "good_er:2"
[13] "good_er:3" "good_er:4" "good_er:5"
> unique(mmcurves_df$dsid_modname)
 [1] random:1  poor_er:1 good_er:1 random:2  poor_er:2 good_er:2 random:3 
 [8] poor_er:3 good_er:3 random:4  poor_er:4 good_er:4 random:5  poor_er:5
[15] good_er:5
takayasaito commented 2 years ago

It seems like I should have preserved the order of the original data. It's too late to include a new bug fix in the next release (v0.12.8), but I will include the following fix in the one after the next release (v0.12.9).

  # Make dsis-modname pairs
  dsid_modnames <- paste(attr(obj, "data_info")$modnames,
                         attr(obj, "data_info")$dsids, sep = ":")

I still need to create unit tests for it, but I hope I can release v0.12.9 that includes this fix in the middle of February.

Thanks a lot again.

csangara commented 2 years ago

Thanks for the fix! And thanks for the great package, it really is the fastest one around. 😄