cstjean / ScikitLearn.jl

Julia implementation of the scikit-learn API https://cstjean.github.io/ScikitLearn.jl/dev/
Other
546 stars 75 forks source link

Parallel GridSearchCV and RandomizedSearchCV now work #37

Closed ksteimel closed 6 years ago

ksteimel commented 6 years ago

This introduces changes that allow Parallel execution of fit! on GridSearchCV and RandomizedSearchCV objects.

The n_jobs parameters to these functions are not used (due to difficulties with getting the shared library and data across to newly spawned processes). Perhaps this could be changed in the future.

Right now, to do fit! on GridSearchCV or RandomizedSearchCV, julia has to be started in local cluster mode or in cluster mode e.g. julia -p 16 or julia --machinefile <file with IP addresses>.

I haven't done many pull requests so if there are things that you need me to do before this can be accepted, just let me know.

cstjean commented 6 years ago

That's a great addition, thank you for working on this! Do you know if parallel code can be tested on travis?

The current test failure is unrelated, but we should probably fix it before merging this PR...

ksteimel commented 6 years ago

Parallel code can be tested on travis: their vms and containers have two cores so they should be able to run julia in local cluster mode.

I made the changes you requested. Let me know if this looks like it needs more work. Thanks!

cstjean commented 6 years ago

The refactoring works as is, but for the record, in Julia you could move your fit_cv_rotation function inside of _fit!, and then you don't have to pass as many arguments:

function _fit!(self::BaseSearchCV, X, y, parameter_iterable)
    ...
    function fit_cv_rotations(parameters)
        ...
    end
    ...
end

Could you please add a test and modify .travis.yml accordingly?

ksteimel commented 6 years ago

Oh cool! I didn't know that. Sorry for the delay. I'm trying to figure out what would be the best way to test this. Would running GridSearch and RandomSearch on a single core and parallel and comparing the two results be ideal?

cstjean commented 6 years ago

Yes, that would be fine.

cstjean commented 6 years ago

BTW, testing is a bit tricky at the moment. See discussion in https://github.com/JuliaLang/METADATA.jl/pull/13293#issuecomment-365217473

In particular, if you rebase on top of master, then make sure you checkout LowRankModels.jl. Let me know if you face any issue.

ksteimel commented 6 years ago

I'm a little confused how to appropriately use the script parameter in travis so I'm trying to sort that out for now.  From: notifications@github.comSent: February 21, 2018 12:42 PMTo: ScikitLearn.jl@noreply.github.comReply-to: reply@reply.github.comCc: ksteimel@umail.iu.edu; author@noreply.github.comSubject: Re: [cstjean/ScikitLearn.jl] Parallel GridSearchCV and RandomizedSearchCV now work (#37) BTW, testing is a bit tricky at the moment. See discussion in JuliaLang/METADATA.jl#13293 (comment) In particular, if you rebase on top of master, then make sure you checkout LowRankModels.jl. Let me know if you face any issue.

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or mute the thread.

cstjean commented 6 years ago

Yeah... Travis isn't working on master. Perhaps I should just merge this PR, and we can work on the tests once the DataFrames and ZLib storms have passed.

cstjean commented 6 years ago

Merged, thank you for your contribution!