Closed tylerjthomas9 closed 1 year ago
Attention: 4 lines
in your changes are missing coverage. Please review.
Comparison is base (
f2395ea
) 95.07% compared to head (2afc496
) 94.49%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Okay, I'm slowly getting back to this PR. Thanks for your patience.
I also added support for seeing feature importances. It would be ideal to store the column names if a table was passed in, but I didn't want to add Tables.jl as a dependency.
Thanks for adding this!! Is this just for the new models or all wrapped models that support this?
My vote would be to add Tables.jl as a dep and use the columnnames. Note, that you cannot use Tables.columnames
directly on a generic table. But here is the kind of thing you can do instead: https://github.com/JuliaAI/Imbalance.jl/blob/65696ca552ad5bc2c791ecdf3a761e28a6e012d6/src/table_wrappers.jl#L19
Is this just for the new models or all wrapped models that support this?
I added it for every model that I found with feature importance. It ended up just being most of the ensemble models. I just searched for feature importance to find all of them https://scikit-learn.org/stable/search.html?q=feature_importance.
Is feature_importances
the right python attribute name?
using MLJBase
import MLJScikitLearnInterface as SK
model = SK.RandomForestClassifier()
@assert reports_feature_importances(model)
mach = machine(model, make_blobs()...) |> fit!
julia> feature_importances(mach)
ERROR: Python: AttributeError: 'RandomForestClassifier' object has no attribute 'feature_importances'
Python stacktrace: none
Stacktrace:
[1] pythrow()
@ PythonCall ~/.julia/packages/PythonCall/qTEA1/src/err.jl:94
[2] errcheck
@ ~/.julia/packages/PythonCall/qTEA1/src/err.jl:10 [inlined]
[3] pygetattr(x::PythonCall.Py, k::String)
@ PythonCall ~/.julia/packages/PythonCall/qTEA1/src/abstract/object.jl:60
[4] getproperty(x::PythonCall.Py, k::Symbol)
@ PythonCall ~/.julia/packages/PythonCall/qTEA1/src/Py.jl:261
[5] feature_importances(#unused#::RandomForestClassifier, ::Tuple{PythonCall.Py, CategoricalArrays.CategoricalValue{Int64, UInt32}, Nothing}, r::NamedTuple{(:names,), Tuple{Vector{Symbol}}})
@ MLJScikitLearnInterface ~/.julia/packages/MLJScikitLearnInterface/YALzX/src/macros.jl:343
[6] _feature_importances(model::RandomForestClassifier, fitresult::Tuple{PythonCall.Py, CategoricalArrays.CategoricalValue{Int64, UInt32}, Nothing}, report::NamedTuple{(:names,), Tuple{Vector{Symbol}}})
@ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:923
[7] feature_importances(mach::Machine{RandomForestClassifier, true})
@ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:911
[8] top-level scope
@ REPL[28]:1
This can be considered out of scope, but we could reasonably view the coefficients in a linear regression model as feature importances as well.
edit I mean the absolute value of the coefficients.
This can be considered out of scope, but we could reasonably view coefficients in a linear regression model as feature importances as well.
I was wondering this as well. Is there a precedent for doing this with MLJ models? I feel like it would make sense to scale the coefficients for a lot of the linear models with the mean of the variable. This would normalize the feature importance for the scale of the variable. Otherwise, variables with extremely low values would appear more important because they would have higher coefficients for similar impacts.
I was wondering this as well. Is there a precedent for doing this with MLJ models? I feel like it would make sense to scale the coefficients for a lot of the linear models with the mean of the variable. This would normalize the feature importance for the scale of the variable. Otherwise, variables with extremely low values would appear more important because they would have higher coefficients for similar impacts.
Ah, you are right! My naive suggestion is flawed, as it is not scale invariant. Wouldn't the standard deviation be more appropriate than the mean (which could be zero, even for strongly informative feature)?
Is
feature_importances
the right python attribute name?
I was just handling the fitresult incorrectly. I have fixed it, and your code should not work.
Ah, you are right! My naive suggestion is flawed, as it is not scale invariant. Wouldn't the standard deviation be more appropriate than the mean (which could be zero, even for strongly informative feature)?
I think scaling them with standard deviation would be good. I'll try adding it tomorrow.
Also, it looks like something broke on the nightly build of julia (not on this library's side), so all those tests are failing.
Thanks for the substantial additions for the linear models.
I see that you have added classifiers as well as regressors. I think we need to be careful here. For example, in logistic "regression" (LogisticClassifier
) there is redundancy in the choice of coefficients - because different coefficients can lead, after the soft-maxing, to the same probabilistic predictions. How the coefficients are fixed is up to the implementation, and so we can't use these for feature importances, right?
I have some ideas how to define a feature importance for probabilistic linear models, based on entropy, but, this time I feel this really is out of scope 😄 . Unless I'm missing something, I suggest we drop the support for classifiers.
You're right. The classifiers do come with a lot more complications. I have removed them for now.
This PR does a couple of things:
Expand model support
Automatically convert more fitted params from Python back to Julia via
pyconvert
.Expand testing coverage for existing clustering models
I also added support for seeing feature importances. It would be ideal to store the column names if a table was passed in, but I didn't want to add Tables.jl as a dependency.
When running the tests, I noticed that MLJBase.jl upgrade to v1 broke the tests, so I fixed them by adding StatisticalMeasures.jl, and then restricted the MLJBase.jl compact.