LudvigOlsen / groupdata2

R-package: Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling or collapse them to fewer groups.
Other
27 stars 3 forks source link

Fix implementation of multiple unique fold columns for repeated cross-validation #6

Closed LudvigOlsen closed 5 years ago

LudvigOlsen commented 5 years ago

TODO for the new functionality (quick implementation) in fold(), intended for repeated cross-validation (repeatedCV branch):

  1. When detecting identical fold columns it repeats column comparisons in secondary iterations. This is unnecessary. Also, create tests to see how this scales with bigger datasets.

  2. For each iteration of creating new fold columns, it creates num_fold_cols columns. This was kind of a lazy implementation. Could perhaps save time by adding 1 or a few at a time.

Seems like there's room for improvement.

LudvigOlsen commented 5 years ago

Fixed both of these.

Then found that you can use unique(as.matrix(data), MARGIN=2) to do a similar thing and test against current approach:

code:

`set.seed(1) df <- data.frame("participant" = factor(rep(c('1','2', '3', '4', '5', '6'), 3)), "age" = rep(c(25,65,34), 3), "diagnosis" = rep(c('a', 'b', 'a', 'a', 'b', 'b'), 3), "score" = c(34,23,54,23,56,76,43,56,76,42,54,1,5,76,34,76,23,65))

df <- df %>% dplyr::arrange(participant, score)

system.time({
df_folded_100reps <- fold(df, 3, num_col = 'score', num_fold_cols=100,max_iters = 100) }) ` Current approach: user system elapsed 16.939 0.266 17.310

Using unique: user system elapsed 247.794 4.186 253.402

So sticking to my own approach. One reason for the difference may be, that I only compare two columns once, while unique can compare two columns up to 100 times in the example.