Closed LudvigOlsen closed 5 years ago
Fixed both of these.
Then found that you can use unique(as.matrix(data), MARGIN=2) to do a similar thing and test against current approach:
code:
`set.seed(1) df <- data.frame("participant" = factor(rep(c('1','2', '3', '4', '5', '6'), 3)), "age" = rep(c(25,65,34), 3), "diagnosis" = rep(c('a', 'b', 'a', 'a', 'b', 'b'), 3), "score" = c(34,23,54,23,56,76,43,56,76,42,54,1,5,76,34,76,23,65))
df <- df %>% dplyr::arrange(participant, score)
system.time({
df_folded_100reps <- fold(df, 3, num_col = 'score', num_fold_cols=100,max_iters = 100)
})
`
Current approach:
user system elapsed
16.939 0.266 17.310
Using unique: user system elapsed 247.794 4.186 253.402
So sticking to my own approach. One reason for the difference may be, that I only compare two columns once, while unique can compare two columns up to 100 times in the example.
TODO for the new functionality (quick implementation) in fold(), intended for repeated cross-validation (repeatedCV branch):
When detecting identical fold columns it repeats column comparisons in secondary iterations. This is unnecessary. Also, create tests to see how this scales with bigger datasets.
For each iteration of creating new fold columns, it creates num_fold_cols columns. This was kind of a lazy implementation. Could perhaps save time by adding 1 or a few at a time.
Seems like there's room for improvement.