Closed thomlake closed 10 years ago
Yah, you are right. I will think about it and redesign the interface for this.
Awesome. May I suggest at minimum it returns the results from evalfun
and estfun
from each fold. Keeping all the models may seem somewhat wasteful, but I don't know how many times I've ran some time consuming CV code and then realized after the fact I wish I would have computed some other metric (likelihood, precision, etc) for each fold.
Also, this package is great. I had about a 10th of this stuff already done when I found it. The idea of abstracting out these sorts of generics (instead of model fitting like most ML libs) is great. I may try to merge some of it in if you don't mind looking over the pull-requests.
I agree. I would also expect something like this:
function cross_validate(estfun::Function, evalfun::Function, n::Integer, gen)
i, first, scores = 0, true, zeros(length(gen))
for test_inds in gen
i += 1
train_inds = setdiff(1:n, test_inds)
model = estfun(train_inds)
scores[i] = evalfun(model, test_inds)
end
return mean(scores), std(scores)
end
The current implementation is more for hyper-parameter tuning, but this might be out of the scope of a general MLBase-package.
I will make the change to cross_validate
today, which would return an entire vector of scores in different runs, so that you may apply whatever statistics functions.
The cross_validate
function has been updated to return a vector of scores.
(See http://mlbasejl.readthedocs.org/en/latest/crossval.html#cross_validate)
Latest version of MLBase tagged as v0.4.2.
I'm confused about the use case for
cross_validate
. It doesn't seem useful to return the fold corresponding to the maximum or minimum value ofevalfun
, since this essentially just tells you which fold is easiest. I think it would make more sense to return an array containing the result of applyingevalfun
to the model returned byestfun
for each split. That way the user can do things like average out-out-sample accuracy, compute, variance, etc. Thoughts?