davidavdav / ROCAnalysis.jl

Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers
Other
31 stars 10 forks source link

Better way of handling missings, and restore support for Julia 0.7 #9

Closed DilumAluthge closed 6 years ago

DilumAluthge commented 6 years ago

This PR:

  1. Implements a better way of handling missings in DataFrames. If either column (target or score) has a missing value in a row, then we drop that row.
  2. Restores support for Julia 0.7.
  3. Configures Travis to test on Julia 0.7, 1.0, and nightly (with failures allowed on nightly).

You will notice that in src/dataframes.jl, I use splatting on lines 18 and 19. This was the only way that I could get the correct eltypes for t_complete and s_complete. If I used collect(...), then the eltype still ends up as Union{T, Missing}. However, when I use splatting, I get the correct eltype T. This is important because the struct TNT expects to be given arguments of type Vector{T} where T <: Real.

cc: @davidavdav @diegozea

davidavdav commented 6 years ago

I suppose collecting over skipmissing is slightly better than splatting, but I'm not sure.

    ok = .!(ismissing.(x[target]) .| ismissing.(x[score]))
    t = BitArray(x[target][ok])
    s = collect(skipmissing(x[score][ok]))
    TNT(s[t], s[.!t])

Also, I don't thing you should use .* for boolean types, even though it might boil down to .&.

It is a pain that you can't infer the type of the array without the Union{Missing, ...} part. At least, I can't. There is coalesce.()but I suppose that is similar in performance.

DilumAluthge commented 6 years ago

Here’s a thought: we know that the type of x[score] is Vector{Union{T, Missing}} where T <: Real. So the type of s (once we’ve removed the missings) is Vector{T} where T <: Real. So, if we want, we could convert(Vector{Cfloat}, s. And then we’d know the type of s in advance because we’d be forcing it via convert.

We could use Float32 or Float64 instead of Cfloat, but that might be less portable across platforms.

davidavdav commented 6 years ago

I thought about it, and I think the collect(skipmissing()) might not be that inefficient. It would be nice if we would have had access to the Array{T} part of the Array{Union{T, Missing}}---apparently, this is how it is represented in memory. Now the generator loops over the Array{UInt8} part, and generates every element of the Array{T} part which is then collected.

If we just carry out the loop ourselves, we could probably save one level of array-building. We might as well just build the tar and nontar scores in that one loop...

For now I've just implemented the code above.

DilumAluthge commented 6 years ago

There’s a function Missings.T that gives you the T from a Union{T, Missing}: https://github.com/JuliaData/Missings.jl/blob/master/src/Missings.jl

But that would mean we’d have to add Missings.jl as a dependency.

It’s worth thinking about.

In the mean time, could you tag a new release with attobot?

On Wed, Sep 5, 2018 at 10:32 David van Leeuwen notifications@github.com wrote:

I thought about it, and I think the collect(skipmissing()) might not be that inefficient. It would be nice if we would have had access to the Array{T} part of the Array{Union{T, Missing}}---apparently, this is how it is represented in memory https://julialang.org/blog/2018/06/missing. Now the generator loops over the Array{UInt8} part, and generates every element of the Array{T} part which is then collected.

If we just carry out the loop ourselves, we could probably save one level of array-building. We might as well just build the tar and nontar scores in that one loop...

For now I've just implemented the code above.

— You are receiving this because you authored the thread.

Reply to this email directly, view it on GitHub https://github.com/davidavdav/ROCAnalysis.jl/pull/9#issuecomment-418752889, or mute the thread https://github.com/notifications/unsubscribe-auth/AFXArZcsNWSedd5IBTrUoE2aknMliQ3Lks5uX-BygaJpZM4WZ--L .

DilumAluthge commented 6 years ago

Here’s the other option: we can do Missings.disallowmissing(s[ok]). I think the disallowmissing function will give us the correct type.

On Wed, Sep 5, 2018 at 13:20 Dilum Aluthge notifications@github.com wrote:

There’s a function Missings.T that gives you the T from a Union{T, Missing}: https://github.com/JuliaData/Missings.jl/blob/master/src/Missings.jl

But that would mean we’d have to add Missings.jl as a dependency.

It’s worth thinking about.

In the mean time, could you tag a new release with attobot?

On Wed, Sep 5, 2018 at 10:32 David van Leeuwen notifications@github.com wrote:

I thought about it, and I think the collect(skipmissing()) might not be that inefficient. It would be nice if we would have had access to the Array{T} part of the Array{Union{T, Missing}}---apparently, this is how it is represented in memory <https://julialang.org/blog/2018/06/missing . Now the generator loops over the Array{UInt8} part, and generates every element of the Array{T} part which is then collected.

If we just carry out the loop ourselves, we could probably save one level of array-building. We might as well just build the tar and nontar scores in that one loop...

For now I've just implemented the code above.

— You are receiving this because you authored the thread.

Reply to this email directly, view it on GitHub < https://github.com/davidavdav/ROCAnalysis.jl/pull/9#issuecomment-418752889 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AFXArZcsNWSedd5IBTrUoE2aknMliQ3Lks5uX-BygaJpZM4WZ--L

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/davidavdav/ROCAnalysis.jl/pull/9#issuecomment-418811395, or mute the thread https://github.com/notifications/unsubscribe-auth/AFXAra83ys0B_IUJiSLauJlWaC00KS29ks5uYAf5gaJpZM4WZ--L .

davidavdav commented 6 years ago

Thanks for the links, this inspired me for the function I needed:

found(::Type{Union{T, Missing}}) where T = T

With this we could do a convert(Array{found(scores)}, scores), but of course, I don't know if the convert is any more efficient.

On Wed, Sep 5, 2018 at 7:23 PM Dilum Aluthge notifications@github.com wrote:

Here’s the other option: we can do Missings.disallowmissing(s[ok]). I think the disallowmissing function will give us the correct type.

On Wed, Sep 5, 2018 at 13:20 Dilum Aluthge notifications@github.com wrote:

There’s a function Missings.T that gives you the T from a Union{T, Missing}: https://github.com/JuliaData/Missings.jl/blob/master/src/Missings.jl

But that would mean we’d have to add Missings.jl as a dependency.

It’s worth thinking about.

In the mean time, could you tag a new release with attobot?

On Wed, Sep 5, 2018 at 10:32 David van Leeuwen notifications@github.com wrote:

I thought about it, and I think the collect(skipmissing()) might not be that inefficient. It would be nice if we would have had access to the Array{T} part of the Array{Union{T, Missing}}---apparently, this is how it is represented in memory <https://julialang.org/blog/2018/06/missing . Now the generator loops over the Array{UInt8} part, and generates every element of the Array{T} part which is then collected.

If we just carry out the loop ourselves, we could probably save one level of array-building. We might as well just build the tar and nontar scores in that one loop...

For now I've just implemented the code above.

— You are receiving this because you authored the thread.

Reply to this email directly, view it on GitHub < https://github.com/davidavdav/ROCAnalysis.jl/pull/9#issuecomment-418752889 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AFXArZcsNWSedd5IBTrUoE2aknMliQ3Lks5uX-BygaJpZM4WZ--L

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/davidavdav/ROCAnalysis.jl/pull/9#issuecomment-418811395, or mute the thread https://github.com/notifications/unsubscribe-auth/AFXAra83ys0B_IUJiSLauJlWaC00KS29ks5uYAf5gaJpZM4WZ--L .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

-- David van Leeuwen