JuliaAI / MLJScikitLearnInterface.jl

MLJ Interface for ScikitLearn.jl
Other
12 stars 6 forks source link

New models, automatically convert more fit results, and support for feature importances #61

Closed tylerjthomas9 closed 1 year ago

tylerjthomas9 commented 1 year ago

This PR does a couple of things:

Expand model support

Automatically convert more fitted params from Python back to Julia via pyconvert.

Expand testing coverage for existing clustering models

I also added support for seeing feature importances. It would be ideal to store the column names if a table was passed in, but I didn't want to add Tables.jl as a dependency.

When running the tests, I noticed that MLJBase.jl upgrade to v1 broke the tests, so I fixed them by adding StatisticalMeasures.jl, and then restricted the MLJBase.jl compact.

codecov[bot] commented 1 year ago

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (f2395ea) 95.07% compared to head (2afc496) 94.49%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #61 +/- ## ========================================== - Coverage 95.07% 94.49% -0.58% ========================================== Files 13 14 +1 Lines 264 309 +45 ========================================== + Hits 251 292 +41 - Misses 13 17 +4 ``` | [Files](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI) | Coverage Δ | | |---|---|---| | [src/MLJScikitLearnInterface.jl](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI#diff-c3JjL01MSlNjaWtpdExlYXJuSW50ZXJmYWNlLmps) | `100.00% <ø> (ø)` | | | [src/models/clustering.jl](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI#diff-c3JjL21vZGVscy9jbHVzdGVyaW5nLmps) | `100.00% <100.00%> (ø)` | | | [src/models/discriminant-analysis.jl](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI#diff-c3JjL21vZGVscy9kaXNjcmltaW5hbnQtYW5hbHlzaXMuamw=) | `100.00% <ø> (ø)` | | | [src/models/ensemble.jl](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI#diff-c3JjL21vZGVscy9lbnNlbWJsZS5qbA==) | `100.00% <100.00%> (ø)` | | | [src/models/gaussian-process.jl](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI#diff-c3JjL21vZGVscy9nYXVzc2lhbi1wcm9jZXNzLmps) | `100.00% <ø> (ø)` | | | [src/models/linear-classifiers.jl](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI#diff-c3JjL21vZGVscy9saW5lYXItY2xhc3NpZmllcnMuamw=) | `100.00% <ø> (ø)` | | | [src/models/linear-regressors-multi.jl](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI#diff-c3JjL21vZGVscy9saW5lYXItcmVncmVzc29ycy1tdWx0aS5qbA==) | `100.00% <ø> (ø)` | | | [src/models/linear-regressors.jl](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI#diff-c3JjL21vZGVscy9saW5lYXItcmVncmVzc29ycy5qbA==) | `100.00% <ø> (ø)` | | | [src/models/misc.jl](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI#diff-c3JjL21vZGVscy9taXNjLmps) | `62.50% <ø> (ø)` | | | [src/models/svm.jl](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI#diff-c3JjL21vZGVscy9zdm0uamw=) | `50.00% <ø> (ø)` | | | ... and [2 more](https://app.codecov.io/gh/JuliaAI/MLJScikitLearnInterface.jl/pull/61?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaAI) | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ablaom commented 1 year ago

Okay, I'm slowly getting back to this PR. Thanks for your patience.

I also added support for seeing feature importances. It would be ideal to store the column names if a table was passed in, but I didn't want to add Tables.jl as a dependency.

Thanks for adding this!! Is this just for the new models or all wrapped models that support this?

My vote would be to add Tables.jl as a dep and use the columnnames. Note, that you cannot use Tables.columnames directly on a generic table. But here is the kind of thing you can do instead: https://github.com/JuliaAI/Imbalance.jl/blob/65696ca552ad5bc2c791ecdf3a761e28a6e012d6/src/table_wrappers.jl#L19

tylerjthomas9 commented 1 year ago

Is this just for the new models or all wrapped models that support this?

I added it for every model that I found with feature importance. It ended up just being most of the ensemble models. I just searched for feature importance to find all of them https://scikit-learn.org/stable/search.html?q=feature_importance.

ablaom commented 1 year ago

Is feature_importances the right python attribute name?

using MLJBase
import MLJScikitLearnInterface as SK
model = SK.RandomForestClassifier()
@assert reports_feature_importances(model)
mach = machine(model, make_blobs()...) |> fit!

julia> feature_importances(mach)
ERROR: Python: AttributeError: 'RandomForestClassifier' object has no attribute 'feature_importances'
Python stacktrace: none
Stacktrace:
 [1] pythrow()
   @ PythonCall ~/.julia/packages/PythonCall/qTEA1/src/err.jl:94
 [2] errcheck
   @ ~/.julia/packages/PythonCall/qTEA1/src/err.jl:10 [inlined]
 [3] pygetattr(x::PythonCall.Py, k::String)
   @ PythonCall ~/.julia/packages/PythonCall/qTEA1/src/abstract/object.jl:60
 [4] getproperty(x::PythonCall.Py, k::Symbol)
   @ PythonCall ~/.julia/packages/PythonCall/qTEA1/src/Py.jl:261
 [5] feature_importances(#unused#::RandomForestClassifier, ::Tuple{PythonCall.Py, CategoricalArrays.CategoricalValue{Int64, UInt32}, Nothing}, r::NamedTuple{(:names,), Tuple{Vector{Symbol}}})
   @ MLJScikitLearnInterface ~/.julia/packages/MLJScikitLearnInterface/YALzX/src/macros.jl:343
 [6] _feature_importances(model::RandomForestClassifier, fitresult::Tuple{PythonCall.Py, CategoricalArrays.CategoricalValue{Int64, UInt32}, Nothing}, report::NamedTuple{(:names,), Tuple{Vector{Symbol}}})
   @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:923
 [7] feature_importances(mach::Machine{RandomForestClassifier, true})
   @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:911
 [8] top-level scope
   @ REPL[28]:1
ablaom commented 1 year ago

This can be considered out of scope, but we could reasonably view the coefficients in a linear regression model as feature importances as well.

edit I mean the absolute value of the coefficients.

tylerjthomas9 commented 1 year ago

This can be considered out of scope, but we could reasonably view coefficients in a linear regression model as feature importances as well.

I was wondering this as well. Is there a precedent for doing this with MLJ models? I feel like it would make sense to scale the coefficients for a lot of the linear models with the mean of the variable. This would normalize the feature importance for the scale of the variable. Otherwise, variables with extremely low values would appear more important because they would have higher coefficients for similar impacts.

ablaom commented 1 year ago

I was wondering this as well. Is there a precedent for doing this with MLJ models? I feel like it would make sense to scale the coefficients for a lot of the linear models with the mean of the variable. This would normalize the feature importance for the scale of the variable. Otherwise, variables with extremely low values would appear more important because they would have higher coefficients for similar impacts.

Ah, you are right! My naive suggestion is flawed, as it is not scale invariant. Wouldn't the standard deviation be more appropriate than the mean (which could be zero, even for strongly informative feature)?

tylerjthomas9 commented 1 year ago

Is feature_importances the right python attribute name?

I was just handling the fitresult incorrectly. I have fixed it, and your code should not work.

tylerjthomas9 commented 1 year ago

Ah, you are right! My naive suggestion is flawed, as it is not scale invariant. Wouldn't the standard deviation be more appropriate than the mean (which could be zero, even for strongly informative feature)?

I think scaling them with standard deviation would be good. I'll try adding it tomorrow.

Also, it looks like something broke on the nightly build of julia (not on this library's side), so all those tests are failing.

ablaom commented 1 year ago

Thanks for the substantial additions for the linear models.

I see that you have added classifiers as well as regressors. I think we need to be careful here. For example, in logistic "regression" (LogisticClassifier) there is redundancy in the choice of coefficients - because different coefficients can lead, after the soft-maxing, to the same probabilistic predictions. How the coefficients are fixed is up to the implementation, and so we can't use these for feature importances, right?

I have some ideas how to define a feature importance for probabilistic linear models, based on entropy, but, this time I feel this really is out of scope 😄 . Unless I'm missing something, I suggest we drop the support for classifiers.

tylerjthomas9 commented 1 year ago

You're right. The classifiers do come with a lot more complications. I have removed them for now.