JuliaAI / MLJModelInterface.jl

Lightweight package to interface with MLJ
MIT License
37 stars 8 forks source link

Remove missing from detector target scitype #123

Closed davnn closed 2 years ago

ablaom commented 2 years ago

I wonder if you could say more about your reasons for the change (I have read your comment). What is the issue you are running into?

davnn commented 2 years ago

There are places where the target_scitype is checked to be e.g. <:Finite, for example in https://github.com/JuliaAI/MLJBase.jl/blob/8de6c5aa8515b89172abdcd27359f034d95f30c3/src/composition/models/stacking.jl#L246. And there are places that rely on glb to infer the composite scitypes.

In both cases, I stumbled into the problem that <:Union{Missing, OrderedFactor{2}} is no supertype of <:Finite and <:Finite is no supertype of <:Union{Missing, OrderedFactor{2}}, thus glb returns Unknown. The eltype of AbstractVector{<:T} is Any so I currently don't have ideas for "smart type manipulation". By the way, the current <:Union{Missing, OrderedFactor{2}} is not even a supertype of <:OrderedFactor{2}, thus it would have to be <:Union{Missing, <:OrderedFactor{2}}.

I think the problem is pretty fundamental and there should be a separate issue explaining it. For the detector case, I think the easiest solution would be to drop missing value support for now until we have figured something out.

What do you think?

ablaom commented 2 years ago

@davnn Thanks for nicely explaining the issues here, and for taking the time to wrap your head around the Stacking code. I believe I now understand the main points.

I feel that a having an interface to handle semi-supervised detectors is more important than having stacks play nicely with detectors, no? And I don't see a way to accommodate the semi-supervised case later on without introducing breaking changes to MMI, which I'm pretty keen to avoid in the near future.

It seems to me that the two issues you mentioned are endemic to stacking and we should at least try to do our best to find remedies there. It's possible the glb issue is tricky. But if the target_scitype for the stack is Unknown in this (somewhat exotic) case then the stack should still run (even without warnings) as MLJBase is pretty forgiving about Unknown scitypes.

I think the other issue you mentioned re pre_judge_transform is a dispatch problem we can fix by widening the type to include missing. Can't we do

pre_judge_transform(ŷ::Node, ::Type{<:Probabilistic}, ::Type{<:AbstractArray{Missing, <:Finite}}) =
    node(ŷ -> pdf(ŷ, levels(first(ŷ))), ŷ)

instead of

pre_judge_transform(ŷ::Node, ::Type{<:Probabilistic}, ::Type{<:AbstractArray{<:Finite}}) =
    node(ŷ -> pdf(ŷ, levels(first(ŷ))), ŷ)

?

Does this not fix things? Or perhaps there are other roadblocks somewhere else?

By the way, the current <:Union{Missing, OrderedFactor{2}} is not even a supertype of <:OrderedFactor{2}, thus it would have to be <:Union{Missing, <:OrderedFactor{2}}.

I'm not sure I understand the difference:

julia> T1 = Union{Missing,OrderedFactor{2}}
Union{Missing, OrderedFactor{2}}

julia> T2 = Union{Missing,<:OrderedFactor{2}}
Union{Missing, var"#s275"} where var"#s275"<:OrderedFactor{2}

julia> T1 <: T2
true

julia> T2 <: T1
true

But if this small change here helps, that's fine with me.

davnn commented 2 years ago

It seems to me that the two issues you mentioned are endemic to stacking

I don't think so unfortunately. I'm stumbling over the same problems with adding EnsembleModel support. I'm not sure if it makes sense to "fake" missing support in various places; it's not that a Stacking model would support missing targets over all.

But if the target_scitype for the stack is Unknown in this (somewhat exotic) case then the stack should still run (even without warnings) as MLJBase is pretty forgiving about Unknown scitypes.

It does run with a warning "Could not infer target_scitype of the stack"`. By the way, of the > 20 supported outlier detection models only two support missing targets (in fitting, in evaluation I don't think missing targets make sense anyways?). I would probably set the default target scitype to represent the more common case, or not? We could always override the target scitype for semi-supervised models (which are more niche).

And I don't see a way to accommodate the semi-supervised case later on without introducing breaking changes to MMI, which I'm pretty keen to avoid in the near future.

The semi-supervised case could simply be overriden by models that support it. We could also drop the default target_scitype in MMI, I think it should be no problem to define it on a per-algorithm basis.

I'm not sure I understand the difference:

You are right, I think I missinterpreted something there.

I think there are three possibilities how we can continue with this:

I would prefer (2) or (3) as I'm not sure if we will stumble into the same problem in more places. I would like to think about semi-supervised support more thorougly before integrating it partly in some areas.

codecov-commenter commented 2 years ago

Codecov Report

Merging #123 (b86722b) into dev (2a77b27) will increase coverage by 3.30%. The diff coverage is 86.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #123      +/-   ##
==========================================
+ Coverage   87.29%   90.59%   +3.30%     
==========================================
  Files           8        8              
  Lines         299      287      -12     
==========================================
- Hits          261      260       -1     
+ Misses         38       27      -11     
Impacted Files Coverage Δ
src/MLJModelInterface.jl 100.00% <ø> (ø)
src/model_api.jl 83.33% <ø> (+13.33%) :arrow_up:
src/model_traits.jl 86.36% <83.87%> (+14.66%) :arrow_up:
src/metadata_utils.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2a77b27...b86722b. Read the comment docs.