JuliaStats / MLBase.jl

A set of functions to support the development of machine learning algorithms
MIT License
186 stars 63 forks source link

cross_validate use case? #2

Closed thomlake closed 10 years ago

thomlake commented 10 years ago

I'm confused about the use case for cross_validate. It doesn't seem useful to return the fold corresponding to the maximum or minimum value of evalfun, since this essentially just tells you which fold is easiest. I think it would make more sense to return an array containing the result of applying evalfun to the model returned by estfun for each split. That way the user can do things like average out-out-sample accuracy, compute, variance, etc. Thoughts?

lindahua commented 10 years ago

Yah, you are right. I will think about it and redesign the interface for this.

thomlake commented 10 years ago

Awesome. May I suggest at minimum it returns the results from evalfun and estfun from each fold. Keeping all the models may seem somewhat wasteful, but I don't know how many times I've ran some time consuming CV code and then realized after the fact I wish I would have computed some other metric (likelihood, precision, etc) for each fold.

Also, this package is great. I had about a 10th of this stuff already done when I found it. The idea of abstracting out these sorts of generics (instead of model fitting like most ML libs) is great. I may try to merge some of it in if you don't mind looking over the pull-requests.

BigCrunsh commented 10 years ago

I agree. I would also expect something like this:

function cross_validate(estfun::Function, evalfun::Function, n::Integer, gen)
    i, first, scores = 0, true, zeros(length(gen))

    for test_inds in gen
        i += 1
        train_inds = setdiff(1:n, test_inds)
        model = estfun(train_inds)
        scores[i] = evalfun(model, test_inds)
    end

    return mean(scores), std(scores)
end

The current implementation is more for hyper-parameter tuning, but this might be out of the scope of a general MLBase-package.

lindahua commented 10 years ago

I will make the change to cross_validate today, which would return an entire vector of scores in different runs, so that you may apply whatever statistics functions.

lindahua commented 10 years ago

The cross_validate function has been updated to return a vector of scores.

(See http://mlbasejl.readthedocs.org/en/latest/crossval.html#cross_validate)

Latest version of MLBase tagged as v0.4.2.