cstjean / ScikitLearn.jl

Julia implementation of the scikit-learn API https://cstjean.github.io/ScikitLearn.jl/dev/
Other
546 stars 75 forks source link

Add computation of intercept to Linear Regression #47

Closed PastelBelem8 closed 5 years ago

PastelBelem8 commented 5 years ago

Fixed Syntax

Add Functionality

PastelBelem8 commented 5 years ago

I believe the tests in the Iris notebook related to the Linear Regression are not according to the ScikitLearnBase interface, and therefore the test is failing.

Not all models implement every function. All input matrices X are vertical (of shape (n_sample, n_feature)).

Therefore we assumed that:

And for that the methods fit! and predict should be invoked like so:

fit!(linreg, X', y)
predict(lr, X')

Alternatively, the calls to fit! and predict can be the same, but the generation of X should be randn(3, 50) instead.

cstjean commented 5 years ago

Thank you for working on this.

I believe the tests in the Iris notebook related to the Linear Regression are not according to the ScikitLearnBase interface, and therefore the test is failing.

That's possible. This package depends on a lot of other packages, and I unfortunately don't have time to maintain it as well as I would like. But in this case, I believe the issue is https://github.com/JuliaPy/PyCall.jl/issues/555, which hasn't been fixed yet.

I'll try to have a better look at the code this weekend.

PastelBelem8 commented 5 years ago

I will implement a few tests using the proposed solution (: Let me know if I can help you in any other subject!

PastelBelem8 commented 5 years ago

@cstjean for the case of multi-target regressions what is the y matrix format? I believe it would be simpler if we sticked to the same format as the one required for X matrices, which is nfeatures x nsamples (at least according to the docs).

What's your opinion on this?

cstjean commented 5 years ago

Hey, I'm very happy to see enthusiasm for contributing to this package. Unfortunately, I'm busy with work, but I can lay out some outline of the state it's in. I wrote this package mostly by copy-pasting code from sklearn, and translating it into Julia. While the interface (ScikitLearnBase.jl) was carefully designed, and is "Julian", the internals are messy. sklearn handles an incredible amount of use cases. That yielded a lot of if issparse(M) ... else ..., some of which I've translated without entirely understanding their purpose. However, I also tried to translate part of sklearn's test suite, along with the notebooks (which, as you've seen, are automatically tested). Thus: if some code's purpose is not clear, and the tests still pass after simplifying it, we should do that.

I'm completely open to any kind of improvement to this package. The only constraint is that if some functionality is already covered by sklearn, we should match their interface as much as possible.

cstjean commented 5 years ago

Suggestions for improvement:

PastelBelem8 commented 5 years ago

Suggestions for improvement:

  • Make PRs to other ML packages, so that they can support the ScikitLearnBase interface
  • Update the interface / tests for sklearn 0.20.0. I think when I did most of the work, it was 0.18.0. In particular, there are a lot of worrying deprecation warnings when I run the tests.
  • Clean up the internals. Refactor it into something more "Julian", maybe?

Are these suggestions for the overall development, or would you like me to implement something differently? Also, I've not been able to successfully setup a platform to run the tests. In my other projects, I usually define a Project.toml and a Manifest.toml. Have you setup this differently?

cstjean commented 5 years ago

Are these suggestions for the overall development, or would you like me to implement something differently?

Suggestions for overall development, but you're welcome to work on anything else, of course.

Also, I've not been able to successfully setup a platform to run the tests. In my other projects, I usually define a Project.toml and a Manifest.toml. Have you setup this differently?

I can look into it this week-end, but normally, in your normal v1.0 environement (or a fresh environment), you can dev ScikitLearn this package, then test ScikitLearn. Does that not work?

cstjean commented 5 years ago

I tried running the tests on my laptop, and I hit https://github.com/JuliaPy/PyPlot.jl/issues/253. Is that the issue you had? I fixed some of the failures, so now Travis passes on Julia 1.0, on Linux. I don't know how to fix the OSX issue, unfortunately.

Your tests now fail here https://travis-ci.org/cstjean/ScikitLearn.jl/jobs/455637583#L752. If that can be fixed, then everything else LGTM.

PastelBelem8 commented 5 years ago

I believe the tests are now passing (the problem were some inconsistencies with the names being used in the tests).

cstjean commented 5 years ago

Thanks!

I just realized that your notebook isn't executed. We should do that and move it to /examples, then rerun this script to show it in the example list. Or maybe it should be linked to in the docs, as a tutorial.

ValdarT commented 5 years ago

Hey I am very surprised that the interface here is samples in columns and features in rows and about @PastelBelem8 referring to it being the convention in ScikitLearnBase. I know that JuliaStats org has this convention but ScikitLearn API has always been samples as rows and features as columns - this is so in Python and also in DecisionTree.jl, for example. And this is also how I understand the line "All input matrices X are vertical (of shape (n_sample, n_feature))."

cstjean commented 5 years ago

🙁 You're completely right, I didn't catch that during the review, and there was no test of the functionality before. That issue should be fixed. Thank you for the report!

PastelBelem8 commented 5 years ago

@ValdarT you are right. I am not sure how I did that! Thanks for pointing that out 👍