pat-alt commented 1 year ago

Just had to defined performance measures and then tapping into MLJ evaluate

pat-alt commented 1 year ago

cc @ablaom I've had a go at implementing evaluation metrics for conformal predictions. This was fairly straight-forward thanks to MLJ's existing infrastructure. I essentially only had to add custom performance measures and this seems to be working.

I have two questions though that you might be able to help me with 🙏🏽

Q1: Firstly, should I extend MMI.evaluate to assert that users only use one of the two applicable custom measures? Something like this:

function MMI.evaluate(model, data...; cache=true, kw_options...)
    @assert measure in available_measures "Performance measure not applicable to `ConformalModel`." 
    MMI.evaluate(model, data...; cache=true, measure=measure, kw_options...)
end

Q2: Secondly, while evaluation runs smoothly, the output it prints for my custom methods look odd. Below is lifted from the example in the README:

> _eval = evaluate!(mach; measure=[emp_coverage, ssc], verbosity=0)
PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ measure                                                                                                                                                                                         ⋯
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    \e[38;2;155;179;224m╭──── \e[38;2;227;172;141mFunction: \e[1m\e[38;5;12memp_coverage\e[22m\e[39m\e[39m\e[38;2;155;179;224m\e[38;2;155;179;224m ──────────────────────────────────────────╮\e ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                                                      \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                             ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m  \e[1m\e[2m(1) \e[22m\e[22m \e[1m\e[38;2;165;198;217memp_coverage\e[22m\e[39m\e[38;2;255;245;157m(\e[39mŷ, y\e[38;2;255;245;157m)\e[39m                ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                                                      \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                             ⋯
│    \e[38;2;155;179;224m╰───────────────────────────────────────────────────────── \e[1m\e[37m1\e[22m\e[39m method\e[38;2;155;179;224m ───╯\e[39m\e[0m\e[39m\e[0m                                ⋯
│ \e[2m\e[32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Docstring\e[0m \e[2m\e[32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\e[0m\e[22m\e[39m\e[22m\e[39m\e[0m                                                        ⋯
│        \e[2m\e[37m\e[48;2;38;38;38m┌─────────────────────────────────────────────────────────┐\e[22m\e[39m\e[49m                                                                                ⋯
│        \e[0m\e[2m\e[37m\e[48;2;38;38;38m│\e[22m\e[39m\e[49m\e[0m\e[48;2;38;38;38m  \e[49m\e[0m\e[48;2;38;38;38m\e[38;2;232;212;114memp_coverage\e[39m\e[38;2;227;136;100m(\e[39m\e[38;2;222;222 ⋯
│        \e[2m\e[37m\e[48;2;38;38;38m└─────────────────────────────────────────────────────────┘\e[22m\e[39m\e[49m\e[0m                                                                           ⋯
│                                                                                                                                                                                                 ⋯
│    Computes the empirical coverage for conformal predictions \e[3m\e[38;2;255;245;157m`\e[23m\e[39m\e[0m\e[38;2;222;222;222mŷ\e[39m\e[3m\e[38;2;255;245;157m`\e[23m\e[39m.\e[0m                 ⋯
│                                                                                                                                                                                                 ⋯
│    \e[38;2;155;179;224m╭──── \e[38;2;227;172;141mFunction: \e[1m\e[38;5;12msize_stratified_coverage\e[22m\e[39m\e[39m\e[38;2;155;179;224m\e[38;2;155;179;224m ──────────────────────────────╮\e ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                                                      \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                             ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m  \e[1m\e[2m(1) \e[22m\e[22m \e[1m\e[38;2;165;198;217msize_stratified_coverage\e[22m\e[39m\e[38;2;255;245;157m(\e[39mŷ, y\e[38;2;255;245;157m)\e[39m    ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                                                      \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                             ⋯
│    \e[38;2;155;179;224m╰───────────────────────────────────────────────────────── \e[1m\e[37m1\e[22m\e[39m method\e[38;2;155;179;224m ───╯\e[39m\e[0m\e[39m\e[0m                                ⋯
│                                                                                                                                                                                                 ⋱
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

When I access the fields of _eval, the produced measurements all check out, but the report looks strange. Any idea what's happening here?

codecov-commenter commented 1 year ago

Codecov Report

Merging #40 (11e5c2e) into main (1f101ec) will increase coverage by 0.27%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #40      +/-   ##
==========================================
+ Coverage   97.59%   97.86%   +0.27%     
==========================================
  Files           8        9       +1     
  Lines         374      422      +48     
==========================================
+ Hits          365      413      +48     
  Misses          9        9

Impacted Files	Coverage Δ
src/conformal_models/conformal_models.jl	`92.30% <ø> (ø)`
src/conformal_models/inductive_regression.jl	`100.00% <ø> (ø)`
src/conformal_models/model_traits.jl	`100.00% <ø> (ø)`
src/conformal_models/plotting.jl	`88.52% <ø> (ø)`
src/conformal_models/inductive_classification.jl	`98.41% <100.00%> (ø)`
...rc/conformal_models/transductive_classification.jl	`100.00% <100.00%> (ø)`
src/conformal_models/transductive_regression.jl	`100.00% <100.00%> (ø)`
src/conformal_models/utils.jl	`100.00% <100.00%> (ø)`
src/evaluation/evaluation.jl	`100.00% <100.00%> (ø)`

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

pat-alt commented 1 year ago

Keeping the branch open until below is sorted out.

cc @ablaom I've had a go at implementing evaluation metrics for conformal predictions. This was fairly straight-forward thanks to MLJ's existing infrastructure. I essentially only had to add custom performance measures and this seems to be working.

I have two questions though that you might be able to help me with 🙏🏽

Q1: Firstly, should I extend MMI.evaluate to assert that users only use one of the two applicable custom measures? Something like this:

function MMI.evaluate(model, data...; cache=true, kw_options...)
    @assert measure in available_measures "Performance measure not applicable to `ConformalModel`." 
    MMI.evaluate(model, data...; cache=true, measure=measure, kw_options...)
end

Q2: Secondly, while evaluation runs smoothly, the output it prints for my custom methods look odd. Below is lifted from the example in the README:

> _eval = evaluate!(mach; measure=[emp_coverage, ssc], verbosity=0)
PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ measure                                                                                                                                                                                         ⋯
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    \e[38;2;155;179;224m╭──── \e[38;2;227;172;141mFunction: \e[1m\e[38;5;12memp_coverage\e[22m\e[39m\e[39m\e[38;2;155;179;224m\e[38;2;155;179;224m ──────────────────────────────────────────╮\e ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                                                      \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                             ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m  \e[1m\e[2m(1) \e[22m\e[22m \e[1m\e[38;2;165;198;217memp_coverage\e[22m\e[39m\e[38;2;255;245;157m(\e[39mŷ, y\e[38;2;255;245;157m)\e[39m                ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                                                      \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                             ⋯
│    \e[38;2;155;179;224m╰───────────────────────────────────────────────────────── \e[1m\e[37m1\e[22m\e[39m method\e[38;2;155;179;224m ───╯\e[39m\e[0m\e[39m\e[0m                                ⋯
│ \e[2m\e[32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Docstring\e[0m \e[2m\e[32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\e[0m\e[22m\e[39m\e[22m\e[39m\e[0m                                                        ⋯
│        \e[2m\e[37m\e[48;2;38;38;38m┌─────────────────────────────────────────────────────────┐\e[22m\e[39m\e[49m                                                                                ⋯
│        \e[0m\e[2m\e[37m\e[48;2;38;38;38m│\e[22m\e[39m\e[49m\e[0m\e[48;2;38;38;38m  \e[49m\e[0m\e[48;2;38;38;38m\e[38;2;232;212;114memp_coverage\e[39m\e[38;2;227;136;100m(\e[39m\e[38;2;222;222 ⋯
│        \e[2m\e[37m\e[48;2;38;38;38m└─────────────────────────────────────────────────────────┘\e[22m\e[39m\e[49m\e[0m                                                                           ⋯
│                                                                                                                                                                                                 ⋯
│    Computes the empirical coverage for conformal predictions \e[3m\e[38;2;255;245;157m`\e[23m\e[39m\e[0m\e[38;2;222;222;222mŷ\e[39m\e[3m\e[38;2;255;245;157m`\e[23m\e[39m.\e[0m                 ⋯
│                                                                                                                                                                                                 ⋯
│    \e[38;2;155;179;224m╭──── \e[38;2;227;172;141mFunction: \e[1m\e[38;5;12msize_stratified_coverage\e[22m\e[39m\e[39m\e[38;2;155;179;224m\e[38;2;155;179;224m ──────────────────────────────╮\e ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                                                      \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                             ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m  \e[1m\e[2m(1) \e[22m\e[22m \e[1m\e[38;2;165;198;217msize_stratified_coverage\e[22m\e[39m\e[38;2;255;245;157m(\e[39mŷ, y\e[38;2;255;245;157m)\e[39m    ⋯
│    \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                                                      \e[0m\e[38;2;155;179;224m│\e[39m\e[0m                                             ⋯
│    \e[38;2;155;179;224m╰───────────────────────────────────────────────────────── \e[1m\e[37m1\e[22m\e[39m method\e[38;2;155;179;224m ───╯\e[39m\e[0m\e[39m\e[0m                                ⋯
│                                                                                                                                                                                                 ⋱
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

When I access the fields of _eval, the produced measurements all check out, but the report looks strange. Any idea what's happening here?

ablaom commented 1 year ago

@pat-alt Great to hear about your progress!

Q1: Firstly, should I extend MMI.evaluate to assert that users only use one of the two applicable custom measures?

Generally the kind of target proxy the measure is used for is articulated with the prediction_type trait. (Measures have traits, just like models. The manual mentions this, but you'll want to look also here if you're contributing new measures.) So, you would do something like:

StatisticalTraits.prediction_type(::Type{<:YourMeasureType}) = :probablisitic_set

edited: The model version of this trait is already suitably overloaded here:

https://github.com/JuliaAI/MLJModelInterface.jl/blob/d9e9703947fc04b0a5e63680289e41d0ba0d65bd/src/model_traits.jl#L27

The evaluate apparatus in MLJBase should check the model matches the measure and throw an error if it doesn't. Possibly, as this is a new target proxy type, the behaviour at MLJBase may need to be adjusted. The relevant logic lives approximately here:

https://github.com/JuliaAI/MLJBase.jl/blob/d79f29b78c5068377e25363884e2ea1c4b4a149a/src/resampling.jl#L600

Q2:

Do you always see this rubbish, or just for your custom measure? Where are you viewing this? Is it in an ordinary terminal or VSCode, notebook, other? Could you please try MLJ.color_off() and see if that helps?

pat-alt commented 1 year ago

Thanks! I'll implement the trait with the goal to contribute once sorted.

As for how this is displayed, I'm working in the VSCode REPL (with Term.jl) and only get this issue for my custom measures. MLJ.color_off() hasn't helped I'm afraid. Perhaps it has to do with the fact that I haven't actually yet properly implemented the measures as outlined in the manual you linked. I'll have a go at that in #44

ablaom commented 1 year ago

Mmm. Not sure about the display issue. I doubt it's anything you are doing wrong. I don't have problem in an emacs term REPL:

julia> evaluate!(mach; measure=[emp_coverage, ssc], verbosity=0)
PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌───────────────────────────────────────────────────────────┬───────────┬───────
│ measure                                                   │ operation │ meas ⋯
├───────────────────────────────────────────────────────────┼───────────┼───────
│ emp_coverage (generic function with 1 method)             │ predict   │ 0.95 ⋯
│ size_stratified_coverage (generic function with 1 method) │ predict   │ 0.75 ⋯
└───────────────────────────────────────────────────────────┴───────────┴───────

JuliaTrustworthyAI / ConformalPrediction.jl

this was easier than expected #40

Codecov Report