JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.78k stars 156 forks source link

Proposed model #574

Open azev77 opened 4 years ago

azev77 commented 4 years ago

Suppose I got a new dataset in the mail today & wanna see which brand-name distribution in Distributions.jl best fits it.

using Distributions, Random, HypothesisTests;

Uni = subtypes(UnivariateDistribution)
#Cts_Uni = subtypes(ContinuousUnivariateDistribution)
DGP_True = LogNormal(17,7);
Random.seed!(123);
const d_train = rand(DGP_True, 1_000)
const d_test  = rand(DGP_True, 1_000)

Er =[]; D_fit  =[];
for d in Uni
    println(d)
    try
        dd = "$(d)"   |> Meta.parse |> eval
        D̂ = fit(dd, d_train)
        Score = [loglikelihood(D̂, d_test),
                OneSampleADTest(d_test, D̂)            |> pvalue,
                ApproximateOneSampleKSTest(d_test, D̂) |> pvalue,
                ExactOneSampleKSTest(d_test, D̂)       |> pvalue,
                #PowerDivergenceTest(d_test,lambda=1)  Not working!!!
                JarqueBeraTest(d_test)                |> pvalue   #Only Normal 
        ];
        #Score = loglikelihood(D̂, ds) #TODO: compute a better score.
        push!(D_fit, [d, D̂, Score])
    catch e
        println(e, d)
        push!(Er, (d,e))
    end
end

a=hcat(D_fit...)
M_names =  a[1,:]; M_fit   =  a[2,:]; M_scores = a[3,:];
idx =sortperm(M_scores, rev=true);
Dfit_sort=hcat(M_names[idx], sort(M_scores, rev=true) )
julia> Dfit_sort
11×3 Array{Any,2}:
 LogNormal              …  [-20600.7, 0.823809, 0.789128, 0.781033, 0.0]
 Gamma                     [-21159.4, 6.0e-7, 2.45426e-68, 1.23247e-69, 0.0]
 Cauchy                    [-24823.3, 6.0e-7, 2.91142e-213, 8.6107e-227, 0.0]
 InverseGaussian           [-26918.1, 6.0e-7, 0.0, 0.0, 0.0]
 Exponential               [-33380.3, 6.0e-7, 0.0, 0.0, 0.0]
 Normal                 …  [-40611.5, 6.0e-7, 1.32495e-213, 3.51792e-227, 0.0]
 Rayleigh                  [-61404.6, 6.0e-7, 0.0, 0.0, 0.0]
 Laplace                   [-2.03419e9, 6.0e-7, 1.49234e-138, 5.47197e-144, 0.0]
 DiscreteNonParametric     [-Inf, 6.0e-7, 0.197933, 0.193494, 0.0]
 Pareto                    [-Inf, 6.0e-7, 6.69184e-108, 3.7704e-111, 0.0]
 Uniform                …  [-Inf, 6.0e-7, 0.0, 0.0, 0.0]

Basically this is predicting Y given X=constant, except the prediction here is not a number but a (unconditional) distribution.

ablaom commented 4 years ago

In MLJ the plan is to view fitting a distribution as probabilistic supervised learning where the input is X=nothing - a single point with no information. The data you have above would be the target, labelled y, and the prediction yhat is a single (probabilistic) prediction. The API is set up for this already - see https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Models-that-learn-a-probability-distribution-1 - but no-one has contributed a model yet.

Is this what you are after?

azev77 commented 4 years ago

Yes, that is. Btw, this case w/ X=nothing can be generalized. For example: y= x*\beta + e, where e ~_{iid} F(\theta) for a large class of probability distributions for the error term.