Closed pat-alt closed 1 year ago
cc @ablaom I've had a go at implementing evaluation metrics for conformal predictions. This was fairly straight-forward thanks to MLJ's existing infrastructure. I essentially only had to add custom performance measures and this seems to be working.
I have two questions though that you might be able to help me with ๐๐ฝ
Q1: Firstly, should I extend MMI.evaluate
to assert that users only use one of the two applicable custom measures? Something like this:
function MMI.evaluate(model, data...; cache=true, kw_options...)
@assert measure in available_measures "Performance measure not applicable to `ConformalModel`."
MMI.evaluate(model, data...; cache=true, measure=measure, kw_options...)
end
Q2: Secondly, while evaluation runs smoothly, the output it prints for my custom methods look odd. Below is lifted from the example in the README:
> _eval = evaluate!(mach; measure=[emp_coverage, ssc], verbosity=0)
PerformanceEvaluation object with these fields:
measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows
Extract:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ measure โฏ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ \e[38;2;155;179;224mโญโโโโ \e[38;2;227;172;141mFunction: \e[1m\e[38;5;12memp_coverage\e[22m\e[39m\e[39m\e[38;2;155;179;224m\e[38;2;155;179;224m โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\e โฏ
โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m โฏ
โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[1m\e[2m(1) \e[22m\e[22m \e[1m\e[38;2;165;198;217memp_coverage\e[22m\e[39m\e[38;2;255;245;157m(\e[39mลท, y\e[38;2;255;245;157m)\e[39m โฏ
โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m โฏ
โ \e[38;2;155;179;224mโฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ \e[1m\e[37m1\e[22m\e[39m method\e[38;2;155;179;224m โโโโฏ\e[39m\e[0m\e[39m\e[0m โฏ
โ \e[2m\e[32mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Docstring\e[0m \e[2m\e[32mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\e[0m\e[22m\e[39m\e[22m\e[39m\e[0m โฏ
โ \e[2m\e[37m\e[48;2;38;38;38mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\e[22m\e[39m\e[49m โฏ
โ \e[0m\e[2m\e[37m\e[48;2;38;38;38mโ\e[22m\e[39m\e[49m\e[0m\e[48;2;38;38;38m \e[49m\e[0m\e[48;2;38;38;38m\e[38;2;232;212;114memp_coverage\e[39m\e[38;2;227;136;100m(\e[39m\e[38;2;222;222 โฏ
โ \e[2m\e[37m\e[48;2;38;38;38mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\e[22m\e[39m\e[49m\e[0m โฏ
โ โฏ
โ Computes the empirical coverage for conformal predictions \e[3m\e[38;2;255;245;157m`\e[23m\e[39m\e[0m\e[38;2;222;222;222myฬ\e[39m\e[3m\e[38;2;255;245;157m`\e[23m\e[39m.\e[0m โฏ
โ โฏ
โ \e[38;2;155;179;224mโญโโโโ \e[38;2;227;172;141mFunction: \e[1m\e[38;5;12msize_stratified_coverage\e[22m\e[39m\e[39m\e[38;2;155;179;224m\e[38;2;155;179;224m โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\e โฏ
โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m โฏ
โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[1m\e[2m(1) \e[22m\e[22m \e[1m\e[38;2;165;198;217msize_stratified_coverage\e[22m\e[39m\e[38;2;255;245;157m(\e[39mลท, y\e[38;2;255;245;157m)\e[39m โฏ
โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m โฏ
โ \e[38;2;155;179;224mโฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ \e[1m\e[37m1\e[22m\e[39m method\e[38;2;155;179;224m โโโโฏ\e[39m\e[0m\e[39m\e[0m โฏ
โ โฑ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
When I access the fields of _eval
, the produced measurements all check out, but the report looks strange. Any idea what's happening here?
Merging #40 (11e5c2e) into main (1f101ec) will increase coverage by
0.27%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #40 +/- ##
==========================================
+ Coverage 97.59% 97.86% +0.27%
==========================================
Files 8 9 +1
Lines 374 422 +48
==========================================
+ Hits 365 413 +48
Misses 9 9
Impacted Files | Coverage ฮ | |
---|---|---|
src/conformal_models/conformal_models.jl | 92.30% <รธ> (รธ) |
|
src/conformal_models/inductive_regression.jl | 100.00% <รธ> (รธ) |
|
src/conformal_models/model_traits.jl | 100.00% <รธ> (รธ) |
|
src/conformal_models/plotting.jl | 88.52% <รธ> (รธ) |
|
src/conformal_models/inductive_classification.jl | 98.41% <100.00%> (รธ) |
|
...rc/conformal_models/transductive_classification.jl | 100.00% <100.00%> (รธ) |
|
src/conformal_models/transductive_regression.jl | 100.00% <100.00%> (รธ) |
|
src/conformal_models/utils.jl | 100.00% <100.00%> (รธ) |
|
src/evaluation/evaluation.jl | 100.00% <100.00%> (รธ) |
:mega: Weโre building smart automated test selection to slash your CI/CD build times. Learn more
Keeping the branch open until below is sorted out.
cc @ablaom I've had a go at implementing evaluation metrics for conformal predictions. This was fairly straight-forward thanks to MLJ's existing infrastructure. I essentially only had to add custom performance measures and this seems to be working.
I have two questions though that you might be able to help me with ๐๐ฝ
Q1: Firstly, should I extend
MMI.evaluate
to assert that users only use one of the two applicable custom measures? Something like this:function MMI.evaluate(model, data...; cache=true, kw_options...) @assert measure in available_measures "Performance measure not applicable to `ConformalModel`." MMI.evaluate(model, data...; cache=true, measure=measure, kw_options...) end
Q2: Secondly, while evaluation runs smoothly, the output it prints for my custom methods look odd. Below is lifted from the example in the README:
> _eval = evaluate!(mach; measure=[emp_coverage, ssc], verbosity=0) PerformanceEvaluation object with these fields: measure, operation, measurement, per_fold, per_observation, fitted_params_per_fold, report_per_fold, train_test_rows Extract: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ measure โฏ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ \e[38;2;155;179;224mโญโโโโ \e[38;2;227;172;141mFunction: \e[1m\e[38;5;12memp_coverage\e[22m\e[39m\e[39m\e[38;2;155;179;224m\e[38;2;155;179;224m โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\e โฏ โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m โฏ โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[1m\e[2m(1) \e[22m\e[22m \e[1m\e[38;2;165;198;217memp_coverage\e[22m\e[39m\e[38;2;255;245;157m(\e[39mลท, y\e[38;2;255;245;157m)\e[39m โฏ โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m โฏ โ \e[38;2;155;179;224mโฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ \e[1m\e[37m1\e[22m\e[39m method\e[38;2;155;179;224m โโโโฏ\e[39m\e[0m\e[39m\e[0m โฏ โ \e[2m\e[32mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Docstring\e[0m \e[2m\e[32mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\e[0m\e[22m\e[39m\e[22m\e[39m\e[0m โฏ โ \e[2m\e[37m\e[48;2;38;38;38mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\e[22m\e[39m\e[49m โฏ โ \e[0m\e[2m\e[37m\e[48;2;38;38;38mโ\e[22m\e[39m\e[49m\e[0m\e[48;2;38;38;38m \e[49m\e[0m\e[48;2;38;38;38m\e[38;2;232;212;114memp_coverage\e[39m\e[38;2;227;136;100m(\e[39m\e[38;2;222;222 โฏ โ \e[2m\e[37m\e[48;2;38;38;38mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\e[22m\e[39m\e[49m\e[0m โฏ โ โฏ โ Computes the empirical coverage for conformal predictions \e[3m\e[38;2;255;245;157m`\e[23m\e[39m\e[0m\e[38;2;222;222;222myฬ\e[39m\e[3m\e[38;2;255;245;157m`\e[23m\e[39m.\e[0m โฏ โ โฏ โ \e[38;2;155;179;224mโญโโโโ \e[38;2;227;172;141mFunction: \e[1m\e[38;5;12msize_stratified_coverage\e[22m\e[39m\e[39m\e[38;2;155;179;224m\e[38;2;155;179;224m โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\e โฏ โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m โฏ โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[1m\e[2m(1) \e[22m\e[22m \e[1m\e[38;2;165;198;217msize_stratified_coverage\e[22m\e[39m\e[38;2;255;245;157m(\e[39mลท, y\e[38;2;255;245;157m)\e[39m โฏ โ \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m \e[0m\e[38;2;155;179;224mโ\e[39m\e[0m โฏ โ \e[38;2;155;179;224mโฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ \e[1m\e[37m1\e[22m\e[39m method\e[38;2;155;179;224m โโโโฏ\e[39m\e[0m\e[39m\e[0m โฏ โ โฑ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
When I access the fields of
_eval
, the produced measurements all check out, but the report looks strange. Any idea what's happening here?
@pat-alt Great to hear about your progress!
Q1: Firstly, should I extend MMI.evaluate to assert that users only use one of the two applicable custom measures?
Generally the kind of target proxy the measure is used for is articulated with the prediction_type
trait. (Measures have traits, just like models. The manual mentions this, but you'll want to look also here if you're contributing new measures.) So, you would do something like:
StatisticalTraits.prediction_type(::Type{<:YourMeasureType}) = :probablisitic_set
edited: The model version of this trait is already suitably overloaded here:
The evaluate
apparatus in MLJBase should check the model matches the measure and throw an error if it doesn't. Possibly, as this is a new target proxy type, the behaviour at MLJBase may need to be adjusted. The relevant logic lives approximately here:
Q2:
Do you always see this rubbish, or just for your custom measure? Where are you viewing this? Is it in an ordinary terminal or VSCode, notebook, other? Could you please try MLJ.color_off()
and see if that helps?
Thanks! I'll implement the trait with the goal to contribute once sorted.
As for how this is displayed, I'm working in the VSCode REPL (with Term.jl) and only get this issue for my custom measures. MLJ.color_off()
hasn't helped I'm afraid. Perhaps it has to do with the fact that I haven't actually yet properly implemented the measures as outlined in the manual you linked. I'll have a go at that in #44
Mmm. Not sure about the display issue. I doubt it's anything you are doing wrong. I don't have problem in an emacs term REPL:
julia> evaluate!(mach; measure=[emp_coverage, ssc], verbosity=0)
PerformanceEvaluation object with these fields:
measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows
Extract:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโ
โ measure โ operation โ meas โฏ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโ
โ emp_coverage (generic function with 1 method) โ predict โ 0.95 โฏ
โ size_stratified_coverage (generic function with 1 method) โ predict โ 0.75 โฏ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโ
Just had to defined performance measures and then tapping into MLJ
evaluate