Closed aoifecahill closed 6 years ago
This is something we should definitely fix in the next release. Thanks for the detailed report!
As a first check, I factored out the part of learner.cross_validate()
that figures out the details of the folds into a separate function and then ran some timing experiments using the titanic data. As the screenshot below shows, although using a folds file is a little slower than using an integer (as expected), it's nowhere near the levels of slowness indicated by the timings reported above by @aoifecahill.
So, next step is to check whether the type of learner matters which is what the timing numbers also indicate. FYI, examples
was instantiated using _load_featureset()
and cv_folds
was instantiated using _load_cv_folds
just like they would have been when using run_experiment
.
I was able to replicate the issue by calling %timeit
on two different run_configuration
calls - one using the regular xval config file and one using an xval config file with cv_folds_file
specified. Here are the results using a DecisionTreeClassifier
and an SVC
. These numbers are a bit more realistic since %timeit
runs the functions multiple times and computes the average.
decision tree, 10 folds
1.76 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
decision tree, folds file
1.05 s ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
SVC, 10 folds
11.9 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
SVC, folds file
1min 50s ± 2.72 s per loop (mean ± std. dev. of 7 runs, 5 loops each)
I think next I will try to profile run_configuration()
to see what's going on in either case.
Okay, I have fixed a bug with how the number of grid search jobs were being calculated in the case when a CV folds file was specified. Essentially, the number of grid search jobs were being forced down to 1 instead of the default of 3 due to this bug. With this bugfix, the stats are now as follows:
SVC, 10 folds
11.7 s ± 72.1 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
SVC, folds file (buggy)
1min 47s ± 471 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
SVC, folds file (fixed)
43.1 s ± 472 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
So, there's a clear ~2.5-3x improvement with this bugfix. However, things are still slower than the non-folds-file case. While some slowdown is to be expected, it still looks like things are about 4x as slow as when not specifying the folds file. I will continue to investigate.
@aoifecahill I think I figured out why things are slower when you specify a cv folds file above and beyond the bug I fixed I mentioned in my previous comments.
It's best explained in the context of an example where we are doing, say, 10-fold cross-validation. Scenario 1 is when we just specify 10 as the number of folds to use for cross-validation (which is also the default for SKLL) and Scenario 2 is when we use a CV folds file that actually contains the same 10 folds but specified as a CSV file. Let's look at what happens in each of these scenarios in detail.
cross_validate()
method, we create ten 90% train, 10% test combinations using a (Stratified) K-Fold CV iterator.train()
method on the training sets.train()
, we use a 3-fold cross-validation to do grid search. 3 is the default number of grid search folds.cross_validate()
method, we create ten 90% train, 10% test combinations using a Leave-One-Fold-Out CV iterator.train()
method on the training sets. train()
, we again use a Leave-One-Fold-Out CV iterator for grid search. This means that you are actually doing a 9-fold grid-search in this particular case as opposed to a 3-fold in scenario 1.I confirmed the above empirically if you examine the grid_searcher
object that's computed in train()
here.
Here's the object as generated in Scenario 1:
# there are 30 total combinations of gamma (6 values) and C (5 values) in the grid
ipdb> len(grid_searcher.cv_results_['params'])
30
# there are 3 train/test splits for a 3-fold xval in grid search as expected
ipdb> [k for k in grid_searcher.cv_results_ if k.startswith('split')]
['split0_test_score', 'split1_test_score', 'split2_test_score',
'split0_train_score', 'split1_train_score', 'split2_train_score']
And here's the same object as generated in Scenario 2:
# still 30 total combinations of gamma (6 values) and C (5 values) in the grid
ipdb> len(grid_searcher.cv_results_['params'])
30
# now there are 9 train/test splits because we are using leave-one-out over 9 groups.
ipdb> [k for k in grid_searcher.cv_results_ if k.startswith('split')]
['split0_test_score', 'split1_test_score', 'split2_test_score',
'split3_test_score', 'split4_test_score', 'split5_test_score',
'split6_test_score', 'split7_test_score', 'split8_test_score',
'split0_train_score', 'split1_train_score', 'split2_train_score',
'split3_train_score', 'split4_train_score', 'split5_train_score',
'split6_train_score', 'split7_train_score', 'split8_train_score']
So, in short, things are slower when you specify CV folds file because you are doing much more work when doing the inner grid-search cross-validation loop.
Interesting, thanks for figuring it out!
My hunch is that the extra work when specifying the folds file is unintentional and that we probably want to still be doing roughly the same amount of work as when we specify just the number of folds.
It's little trickier than that because train()
is both a direct entry point for run_experiment()
(via the train
task) and an indirect one via cross_validate()
. And it looks like we allow cv_folds_file
to also be used as a way to define the grid search just for regular training, i.e., users can specify cv_folds_file
with task=train
and SKLL will use leave-one-out instead of 3-fold cross-validation for grid search even then.
I see two solutions:
(1) We don't care about externally defining folds for grid search when task=train
so change train()
to always use 3-fold cross-validation for grid search (or whatever number the user specifies using grid_search_folds
in the config file; the default is 3).
(2) We think externally defining grid-search folds for task=train
is a useful but it's not useful for the internal second-level cross-validation loop for task=cross_validate
. So, we want to allow the former but default to 3-fold for the latter.
Personally, I don't think (2) makes sense - either you care about explicitly defining the actual grid-search folds when training the model or not. However, may be I am missing some use case where this would matter. Please chime in if you think you have an example of such a use case or if you have any other thoughts/questions.
To me the most common case of specifying folds is to assure that the folds are independent in terms of one or more variables so that the model does not learn to differentiate based on a confound: e.g. the data from the same person is always in the same fold. From this point of view, to me it makes sense to keep the same folds for both loops as is done now as it makes for a cleaner experiment.
I don't see an issue with longer timings in exchange for this level of control but would describe this in documentation and suggest the users use lower N for outer fold if they want to reduce the time while controlling the folds.
An alternative could be to ask the user to provide nested folds - the .csv file with folds would also contain variables indicating the fold for internal loop. This way the user can control the number and composition of folds in both loops.
@aloukina thanks for the insightful comment. I agree that it definitely makes for a cleaner experiment if you use the same folds file for both the inner and outer loops especially in cases where you want to ensure that the data from a fold is not in both train and the test partitions at the same time. So, perhaps your suggestion of leaving things as they are but making the documentation more explicit is the right thing to do here.
I don't think we want to add nested folds. That's a little too complicated and I am not really sure anyone would use it, TBH.
@aoifecahill do you have any further thoughts about this? Anyone else?
How easy would it be to allow users to choose between either using same folds for both loops or specify N folds for inner loop but loose control over fold composition?
It's probably doable but do we really want to allow the "wrong" thing just to save time? That's what I was alluding to earlier - either you want to be all in or not in at all? What's the use case where you might want to be in the middle?
If you pre-specified folds because of other considerations (e.g. you already wrote up a description for each fold and don't want to re-write it) and don't really care about their composition as such but do care about timing? I suppose this might be the reason for @aoifecahill 's concern in the first place . But I generally agree with @desilinguist.
Yes, that's exactly my situation. My fold ids for the cross validation experiment come from an external process, and there are 10 of them, but I don't necessarily want to do leave-one-out cross validation for the tuning during each fold. I would be happy for that to be fixed at 3 like it is if you simply specify the number of folds. Because of the length of time it takes to run these experiments, running leave-one-out on both inner and outer loops adds days or even weeks to my cross validation experiments.
I see how that could potentially be the "wrong" thing, depending on how the folds were generated in the first place. For me that is not the case though and so I would like to be able to control the inner and outer loops separately.
Hmm, I see @aoifecahill's point as well. Perhaps we need another setting, say cv_preserve_folds
, which is set to true by default but you can override it? I am not sure that's the best solution though. Any other suggestions people?
Actually, given @aloukina's point, I guess it makes sense to continue to allow using externally specified folds to use inside train()
for grid search even when task=train?
Is that correct? @aoifecahill @aloukina ?
Okay, I thought a bit more about it and here's a possible solution:
We rename cv_folds_file
to folds_file
and make it applicable to both the training and the cross-validation task (as it already is but we document the train()
usage explicitly).
We add another option called cv_preserve_folds
which is set to true but can be overridden and in that case the inner-loop grid-search does not use the same folds but uses the value of grid_search_folds
(3 by default).
If you specify folds_file
and task=train
, the cv_preserve_folds
option is simply ignored.
@aloukina said she's okay with this solution. @aoifecahill what do you think? Others?
This sounds reasonable, thanks!
Okay, I have implemented the solution we discussed above (PR coming in in the next hour or so).
I re-ran the timing experiments to check whether my solution works and it seems like it does.
SVC, 10 folds, no folds file
%timeit -n 2 run_configuration('cross_validate.cfg', local=True, quiet=True)
11.7 s ± 103 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
SVC, folds file with 10 folds, using folds file for inner grid search,
%timeit -n 2 run_configuration('cross_validate_cv_file.cfg', local=True, quiet=True)
42.5 s ± 79.2 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
SVC, folds file with 10 folds, not using folds file for inner grid search
%timeit -n 2 run_configuration('cross_validate_cv_file_faster.cfg', local=True, quiet=True)
11.8 s ± 223 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
Addressed by #367.
Attached is a version of the titanic
cross_validate.cfg
that specifies the cross validation folds (as output by the original titaniccross_validate.cfg
config).Here are the timings with and without specifying the fold ids file:
Some of the experiments take much longer (e.g. SVC and Random Forrest), while others are similar.
All of these timings are from experiments run in local mode.
cross_validate_specify_folds.cfg.txt