Closed darrencl closed 4 years ago
Okay, there are a couple of good questions here (as I understand them):
How to integrate into MLJ a model whose purpose is to transform the input data X but needs to look at target variable y - a model which, furthermore, has no sensible predict
method (a method returning something whose scitype matches target_scitype(MyModel)
, or is a probabilistic version of same.)
How to create a composite mode that is basically non-branching but has multiple supervised elements.
I have opened this issue to address 1. If you adopt my suggestion there - that you implement the feature selector as an Unsupervised
model but bundle the "target" into X
- then you need not address issue 2 at all.
In the case that you really do need to combine two Supervised models in a pipeline, your recourse would be to abandon @pipeline and define a new model type by first specifying a learning network, and then using @from_network
to create the new composite type. This is well-documented in the manual and MLJTutorials. Learning networks are extremely flexible; see eg, this stacking example.
Oh, and yes, we would love you to implement an MLJ interface for your tool! Instructions are here.
@ablaom Thanks a lot for your answer! I agree that predict
function is also not sensible for this case. I will try to implement this as per your recommendation then!
The minor issue I see with this is probably the convenience to manipulate the data before the feature selection, e.g. Standardize. With bundling target y
into input X
, the users need to specify the exact features to be transformed (since we don't want to transform the target).
@ablaom Anyway, is it possible to add another keyword argument in Standardizer
to ignore feature with name?
In relation to my use case, I have this pipeline something like this.
@pipeline MyLogisticPipe(
preprocessor=MyCustomTransformer(),
scaler=Standardizer(),
selector=CorrelationFeatureSelector(),
model=logistic_model
) prediction_type=:probabilistic
As can be seen, the Standardizer
comes after I pre-process my data, which turns an array of FID (in 1 column) to a spectrum (~1022-~4094 cols depending on the parameters), so I couldn't pass in scaler.features
, since the feature names are dynamic and only available within the pipeline. Therefore, I need be able to tell Standardizer
the feature name to ignore (in this case my target, since it's bundled within features).
I can submit a PR to MLJModels.jl
if you think this makes sense. :)
Thanks
I can submit a PR to MLJModels.jl if you think this makes sense. :)
Makes sense to me. A PR would be most welcome!
Please review the brief contributing guidelines
@darrencl Can we close?
@ablaom Yes I think this can be closed. Thanks!
Hi, I am implementing correlation-based feature selection technique. The idea is to filter the variable with (Pearson) correlation to target variable is less than a threshold, or select
n
variables with highest correlation. I am more than happy submit a PR toMLJModels
to share this implementation if maintainer would like to have this (would probably implement model-based selection too later). I just thought this could be useful for others. :)Anyway, to implement this, I am extending
MLJModels.Supervised
due to the need to see the target variabley
during fitting. However, it seems that thePipeline
doesn't allow having multipleSupervised
model.Is there any workaround for this?
Thanks!