JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.79k stars 158 forks source link

@pipeline to accept multiple Supervised models #455

Closed darrencl closed 4 years ago

darrencl commented 4 years ago

Hi, I am implementing correlation-based feature selection technique. The idea is to filter the variable with (Pearson) correlation to target variable is less than a threshold, or select n variables with highest correlation. I am more than happy submit a PR to MLJModels to share this implementation if maintainer would like to have this (would probably implement model-based selection too later). I just thought this could be useful for others. :)

Anyway, to implement this, I am extending MLJModels.Supervised due to the need to see the target variable y during fitting. However, it seems that the Pipeline doesn't allow having multiple Supervised model.

ERROR: LoadError: LoadError: ArgumentError: @pipeline error.
More than one component of the pipeline is a supervised model .
Stacktrace:
 [1] pipe_alert(::String) at /home/tdlukas/.julia/packages/MLJBase/XRxRK/src/composition/pipelines.jl:84
 [2] pipeline_preprocess(::Module, ::Expr, ::Bool) at /home/tdlukas/.julia/packages/MLJBase/XRxRK/src/composition/pipelines.jl:166
 [3] pipeline_preprocess(::Module, ::Expr, ::Expr) at /home/tdlukas/.julia/packages/MLJBase/XRxRK/src/composition/pipelines.jl:227
 [4] pipeline_(::Module, ::Expr, ::Expr) at /home/tdlukas/.julia/packages/MLJBase/XRxRK/src/composition/pipelines.jl:232
 [5] @pipeline(::LineNumberNode, ::Module, ::Vararg{Any,N} where N) at /home/tdlukas/.julia/packages/MLJBase/XRxRK/src/composition/pipelines.jl:332
 [6] include at ./boot.jl:328 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1105
 [8] include(::Module, ::String) at ./Base.jl:31
 [9] exec_options(::Base.JLOptions) at ./client.jl:287
 [10] _start() at ./client.jl:460
in expression starting at /dmf/tri_services/Facilities/Imaging/Darren/DoD_classifier/compare_approaches.jl:214
in expression starting at /dmf/tri_services/Facilities/Imaging/Darren/DoD_classifier/compare_approaches.jl:195

Is there any workaround for this?

Thanks!

ablaom commented 4 years ago

Okay, there are a couple of good questions here (as I understand them):

  1. How to integrate into MLJ a model whose purpose is to transform the input data X but needs to look at target variable y - a model which, furthermore, has no sensible predict method (a method returning something whose scitype matches target_scitype(MyModel), or is a probabilistic version of same.)

  2. How to create a composite mode that is basically non-branching but has multiple supervised elements.

I have opened this issue to address 1. If you adopt my suggestion there - that you implement the feature selector as an Unsupervised model but bundle the "target" into X - then you need not address issue 2 at all.

In the case that you really do need to combine two Supervised models in a pipeline, your recourse would be to abandon @pipeline and define a new model type by first specifying a learning network, and then using @from_network to create the new composite type. This is well-documented in the manual and MLJTutorials. Learning networks are extremely flexible; see eg, this stacking example.

Oh, and yes, we would love you to implement an MLJ interface for your tool! Instructions are here.

darrencl commented 4 years ago

@ablaom Thanks a lot for your answer! I agree that predict function is also not sensible for this case. I will try to implement this as per your recommendation then!

The minor issue I see with this is probably the convenience to manipulate the data before the feature selection, e.g. Standardize. With bundling target y into input X, the users need to specify the exact features to be transformed (since we don't want to transform the target).

darrencl commented 4 years ago

@ablaom Anyway, is it possible to add another keyword argument in Standardizer to ignore feature with name?

In relation to my use case, I have this pipeline something like this.

@pipeline MyLogisticPipe(
        preprocessor=MyCustomTransformer(),
        scaler=Standardizer(),
        selector=CorrelationFeatureSelector(),
        model=logistic_model
    ) prediction_type=:probabilistic

As can be seen, the Standardizer comes after I pre-process my data, which turns an array of FID (in 1 column) to a spectrum (~1022-~4094 cols depending on the parameters), so I couldn't pass in scaler.features, since the feature names are dynamic and only available within the pipeline. Therefore, I need be able to tell Standardizer the feature name to ignore (in this case my target, since it's bundled within features).

I can submit a PR to MLJModels.jl if you think this makes sense. :)

Thanks

ablaom commented 4 years ago

I can submit a PR to MLJModels.jl if you think this makes sense. :)

Makes sense to me. A PR would be most welcome!

Please review the brief contributing guidelines

ablaom commented 4 years ago

@darrencl Can we close?

darrencl commented 4 years ago

@ablaom Yes I think this can be closed. Thanks!