EducationalTestingService / skll

SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.
http://skll.readthedocs.org
Other
551 stars 67 forks source link

specifying cross validation folds slows down experiments #363

Closed aoifecahill closed 6 years ago

aoifecahill commented 7 years ago

Attached is a version of the titanic cross_validate.cfg that specifies the cross validation folds (as output by the original titanic cross_validate.cfg config).

Here are the timings with and without specifying the fold ids file:

Titanic_CV_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier.results:Total Time: 0:01:26.131353
Titanic_CV_specify_folds_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier.results:Total Time: 0:09:54.971322

Titanic_CV_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier.results:Total Time: 0:00:02.022798
Titanic_CV_specify_folds_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier.results:Total Time: 0:00:01.908257

Titanic_CV_family.csv+misc.csv+socioeconomic.csv+vitals.csv_SVC.results:Total Time: 0:00:32.489498
Titanic_CV_specify_folds_family.csv+misc.csv+socioeconomic.csv+vitals.csv_SVC.results:Total Time: 0:05:17.998272

Titanic_CV_family.csv+misc.csv+socioeconomic.csv+vitals.csv_MultinomialNB.results:Total Time: 0:00:01.921427
Titanic_CV_specify_folds_family.csv+misc.csv+socioeconomic.csv+vitals.csv_MultinomialNB.results:Total Time: 0:00:02.363040

Some of the experiments take much longer (e.g. SVC and Random Forrest), while others are similar.

All of these timings are from experiments run in local mode.

cross_validate_specify_folds.cfg.txt

desilinguist commented 7 years ago

This is something we should definitely fix in the next release. Thanks for the detailed report!

desilinguist commented 7 years ago

As a first check, I factored out the part of learner.cross_validate() that figures out the details of the folds into a separate function and then ran some timing experiments using the titanic data. As the screenshot below shows, although using a folds file is a little slower than using an integer (as expected), it's nowhere near the levels of slowness indicated by the timings reported above by @aoifecahill.

screen shot 2017-10-04 at 10 31 47 am

So, next step is to check whether the type of learner matters which is what the timing numbers also indicate. FYI, examples was instantiated using _load_featureset() and cv_folds was instantiated using _load_cv_folds just like they would have been when using run_experiment.

desilinguist commented 7 years ago

I was able to replicate the issue by calling %timeit on two different run_configuration calls - one using the regular xval config file and one using an xval config file with cv_folds_file specified. Here are the results using a DecisionTreeClassifier and an SVC. These numbers are a bit more realistic since %timeit runs the functions multiple times and computes the average.

decision tree, 10 folds
  1.76 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
decision tree, folds file
  1.05 s ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)

SVC, 10 folds
  11.9 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
SVC, folds file
  1min 50s ± 2.72 s per loop (mean ± std. dev. of 7 runs, 5 loops each)

I think next I will try to profile run_configuration() to see what's going on in either case.

desilinguist commented 7 years ago

Okay, I have fixed a bug with how the number of grid search jobs were being calculated in the case when a CV folds file was specified. Essentially, the number of grid search jobs were being forced down to 1 instead of the default of 3 due to this bug. With this bugfix, the stats are now as follows:

SVC, 10 folds
11.7 s ± 72.1 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

SVC, folds file (buggy)
1min 47s ± 471 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

SVC, folds file (fixed)
43.1 s ± 472 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

So, there's a clear ~2.5-3x improvement with this bugfix. However, things are still slower than the non-folds-file case. While some slowdown is to be expected, it still looks like things are about 4x as slow as when not specifying the folds file. I will continue to investigate.

desilinguist commented 7 years ago

@aoifecahill I think I figured out why things are slower when you specify a cv folds file above and beyond the bug I fixed I mentioned in my previous comments.

It's best explained in the context of an example where we are doing, say, 10-fold cross-validation. Scenario 1 is when we just specify 10 as the number of folds to use for cross-validation (which is also the default for SKLL) and Scenario 2 is when we use a CV folds file that actually contains the same 10 folds but specified as a CSV file. Let's look at what happens in each of these scenarios in detail.

Scenario 1: 10 folds specified as integer

  1. In the cross_validate() method, we create ten 90% train, 10% test combinations using a (Stratified) K-Fold CV iterator.
  2. For each of these combinations, we call the the train() method on the training sets.
  3. Inside train(), we use a 3-fold cross-validation to do grid search. 3 is the default number of grid search folds.

Scenarion 2: 10 folds specified in a CV folds file

  1. In the cross_validate() method, we create ten 90% train, 10% test combinations using a Leave-One-Fold-Out CV iterator.
  2. For each of these combinations, we call the the train() method on the training sets.
  3. HERE'S THE IMPORTANT BIT. Inside train(), we again use a Leave-One-Fold-Out CV iterator for grid search. This means that you are actually doing a 9-fold grid-search in this particular case as opposed to a 3-fold in scenario 1.

I confirmed the above empirically if you examine the grid_searcher object that's computed in train() here.

Here's the object as generated in Scenario 1:


# there are 30 total combinations of gamma (6 values) and C (5 values) in the grid
ipdb> len(grid_searcher.cv_results_['params'])
30

# there are 3 train/test splits for a 3-fold xval in grid search as expected
ipdb> [k for k in grid_searcher.cv_results_ if k.startswith('split')]
['split0_test_score', 'split1_test_score', 'split2_test_score',
 'split0_train_score', 'split1_train_score', 'split2_train_score']

And here's the same object as generated in Scenario 2:


# still 30 total combinations of gamma (6 values) and C (5 values) in the grid
ipdb> len(grid_searcher.cv_results_['params'])
30

# now there are 9 train/test splits because we are using leave-one-out over 9 groups.
ipdb> [k for k in grid_searcher.cv_results_ if k.startswith('split')]
['split0_test_score', 'split1_test_score', 'split2_test_score', 
 'split3_test_score', 'split4_test_score', 'split5_test_score', 
 'split6_test_score', 'split7_test_score', 'split8_test_score', 
 'split0_train_score', 'split1_train_score', 'split2_train_score', 
 'split3_train_score', 'split4_train_score', 'split5_train_score', 
 'split6_train_score', 'split7_train_score', 'split8_train_score']

So, in short, things are slower when you specify CV folds file because you are doing much more work when doing the inner grid-search cross-validation loop.

aoifecahill commented 7 years ago

Interesting, thanks for figuring it out!

My hunch is that the extra work when specifying the folds file is unintentional and that we probably want to still be doing roughly the same amount of work as when we specify just the number of folds.

desilinguist commented 7 years ago

It's little trickier than that because train() is both a direct entry point for run_experiment() (via the train task) and an indirect one via cross_validate(). And it looks like we allow cv_folds_file to also be used as a way to define the grid search just for regular training, i.e., users can specify cv_folds_file with task=train and SKLL will use leave-one-out instead of 3-fold cross-validation for grid search even then.

I see two solutions:

(1) We don't care about externally defining folds for grid search when task=train so change train() to always use 3-fold cross-validation for grid search (or whatever number the user specifies using grid_search_folds in the config file; the default is 3).

(2) We think externally defining grid-search folds for task=train is a useful but it's not useful for the internal second-level cross-validation loop for task=cross_validate. So, we want to allow the former but default to 3-fold for the latter.

Personally, I don't think (2) makes sense - either you care about explicitly defining the actual grid-search folds when training the model or not. However, may be I am missing some use case where this would matter. Please chime in if you think you have an example of such a use case or if you have any other thoughts/questions.

aloukina commented 7 years ago

To me the most common case of specifying folds is to assure that the folds are independent in terms of one or more variables so that the model does not learn to differentiate based on a confound: e.g. the data from the same person is always in the same fold. From this point of view, to me it makes sense to keep the same folds for both loops as is done now as it makes for a cleaner experiment.

I don't see an issue with longer timings in exchange for this level of control but would describe this in documentation and suggest the users use lower N for outer fold if they want to reduce the time while controlling the folds.

An alternative could be to ask the user to provide nested folds - the .csv file with folds would also contain variables indicating the fold for internal loop. This way the user can control the number and composition of folds in both loops.

desilinguist commented 7 years ago

@aloukina thanks for the insightful comment. I agree that it definitely makes for a cleaner experiment if you use the same folds file for both the inner and outer loops especially in cases where you want to ensure that the data from a fold is not in both train and the test partitions at the same time. So, perhaps your suggestion of leaving things as they are but making the documentation more explicit is the right thing to do here.

I don't think we want to add nested folds. That's a little too complicated and I am not really sure anyone would use it, TBH.

@aoifecahill do you have any further thoughts about this? Anyone else?

aloukina commented 7 years ago

How easy would it be to allow users to choose between either using same folds for both loops or specify N folds for inner loop but loose control over fold composition?

desilinguist commented 7 years ago

It's probably doable but do we really want to allow the "wrong" thing just to save time? That's what I was alluding to earlier - either you want to be all in or not in at all? What's the use case where you might want to be in the middle?

aloukina commented 7 years ago

If you pre-specified folds because of other considerations (e.g. you already wrote up a description for each fold and don't want to re-write it) and don't really care about their composition as such but do care about timing? I suppose this might be the reason for @aoifecahill 's concern in the first place . But I generally agree with @desilinguist.

aoifecahill commented 7 years ago

Yes, that's exactly my situation. My fold ids for the cross validation experiment come from an external process, and there are 10 of them, but I don't necessarily want to do leave-one-out cross validation for the tuning during each fold. I would be happy for that to be fixed at 3 like it is if you simply specify the number of folds. Because of the length of time it takes to run these experiments, running leave-one-out on both inner and outer loops adds days or even weeks to my cross validation experiments.

aoifecahill commented 7 years ago

I see how that could potentially be the "wrong" thing, depending on how the folds were generated in the first place. For me that is not the case though and so I would like to be able to control the inner and outer loops separately.

desilinguist commented 7 years ago

Hmm, I see @aoifecahill's point as well. Perhaps we need another setting, say cv_preserve_folds, which is set to true by default but you can override it? I am not sure that's the best solution though. Any other suggestions people?

desilinguist commented 7 years ago

Actually, given @aloukina's point, I guess it makes sense to continue to allow using externally specified folds to use inside train() for grid search even when task=train? Is that correct? @aoifecahill @aloukina ?

desilinguist commented 7 years ago

Okay, I thought a bit more about it and here's a possible solution:

  1. We rename cv_folds_file to folds_file and make it applicable to both the training and the cross-validation task (as it already is but we document the train() usage explicitly).

  2. We add another option called cv_preserve_folds which is set to true but can be overridden and in that case the inner-loop grid-search does not use the same folds but uses the value of grid_search_folds (3 by default).

  3. If you specify folds_file and task=train, the cv_preserve_folds option is simply ignored.

@aloukina said she's okay with this solution. @aoifecahill what do you think? Others?

aoifecahill commented 7 years ago

This sounds reasonable, thanks!

desilinguist commented 6 years ago

Okay, I have implemented the solution we discussed above (PR coming in in the next hour or so).

I re-ran the timing experiments to check whether my solution works and it seems like it does.

SVC, 10 folds, no folds file
%timeit -n 2 run_configuration('cross_validate.cfg', local=True, quiet=True)
11.7 s ± 103 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

SVC, folds file with 10 folds, using folds file for inner grid search, 
%timeit -n 2 run_configuration('cross_validate_cv_file.cfg', local=True, quiet=True)
42.5 s ± 79.2 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

SVC, folds file with 10 folds, not using folds file for inner grid search
%timeit -n 2 run_configuration('cross_validate_cv_file_faster.cfg', local=True, quiet=True)
11.8 s ± 223 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
desilinguist commented 6 years ago

Addressed by #367.