JuliaStats / StatsBase.jl

Basic statistics for Julia
Other
584 stars 191 forks source link

Bug or Feature: `fit(Histogram,...)` and DataFrames #266

Open 0kto opened 7 years ago

0kto commented 7 years ago

Hi all, I am relying quite a bit on Histograms in my scripts for my PhD, and I often calculate intensities of 2D detectors by supplying coordinates x,y and the signal intensity I as weight to the fit(Histogram,(x,y),weights(I). This is immideatly plottable via PyPlots plt.imshow(), etc

Since I get my data from DataFrames, a typical command looks like:

dataTuple = (df[:colX], df[:colY])
weights   = WeightVec(df_test[:I])

hist      = fit(Histogram,dataTuple,weights)

This stopped working, instead I have to convert my DataFrames:

dataTuple = convert(Tuple{Array{Float64,1},Array{Float64,1}},(df[:colX],df[:colY]))
weights   = WeightVec(convert(Array{Float64,1},df[:I]))

was that intentional?

ararslan commented 7 years ago

This was an intentional change introduced in #248. After a bit of discussion we decided that it doesn't make sense to have a weight vector that allows contains missing values, since the sum of the weights can would be missing. I think we were under the impression that it wasn't a common use case to have DataArrays passed to the WeightVec constructor, though I hadn't considered when data is coming from a DataFrame. You'll now need to convert any columns you had intended to be used for weighting into normal Vectors.

Do you have anything to add here, @nalimilan?

rofinn commented 7 years ago

@ararslan Would it make sense to add automatic conversion for DataArrays and I guess NullableArrays?

ararslan commented 7 years ago

We don't want this package to depend on DataArrays or NullableArrays, so the best we could do would be to add a conversion to Vector inside the constructor, which will fail if there are any null values.

andreasnoack commented 7 years ago

@ararslan I don't see why #248 would exclude the use of DataVectors. Am I missing something?

ararslan commented 7 years ago

Oh, no, you're right. I'm actually not sure why this is happening.

rofinn commented 7 years ago

@0kto Are there missing values in df_test[:I]?

julia> d = @data([1, 2, NA, 4])
4-element DataArrays.DataArray{Int64,1}:
 1
 2
  NA
 4

julia> weights(d)
ERROR: MethodError: no method matching StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}(::DataArrays.DataArray{Int64,1}, ::DataArrays.NAtype)
Closest candidates are:
  StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}{V<:AbstractArray{T<:Real,1}}(::V<:AbstractArray{T<:Real,1}) at /Users/rory/.julia/v0.5/StatsBase/src/weights.jl:63
  StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}{S<:Real,V<:AbstractArray{T<:Real,1}}(::V<:AbstractArray{T<:Real,1}, ::S<:Real) at /Users/rory/.julia/v0.5/StatsBase/src/weights.jl:63
  StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}{T}(::Any) at sysimg.jl:53
 in weights(::DataArrays.DataArray{Int64,1}) at /Users/rory/.julia/v0.5/StatsBase/src/weights.jl:71

vs

julia> d = @data([1, 2, 4])
3-element DataArrays.DataArray{Int64,1}:
 1
 2
 4

julia> weights(d)
3-element StatsBase.Weights{Int64,Int64,DataArrays.DataArray{Int64,1}}:
 1
 2
 4
0kto commented 7 years ago

Yes, the not all coordinates carry a signal

rofinn commented 7 years ago

Ahh, okay. Yeah, that behaviour is intentional, but I could be convinced otherwise.

ararslan commented 7 years ago

I maintain that it doesn't really make sense to allow the case where the sum of the weights is missing. What behavior would you expect that to exhibit?

Since DataArrays depends on StatsBase, perhaps we should add an explicit WeightVec constructor for DataArray input in that package, though I'm not sure what such a constructor should do exactly. I guess we could have it convert NAs to 0 weights.

0kto commented 7 years ago

I am home, so I can write a bit more detailed.

After reading your comments I identify two problems for my use case:

The first is that even for dataTuple a DataFrame column is not allowed. That means that fit(Histogram,dataTuple) will cause an error if dataTuple is constructed directly from DataFrames.

The second issue is the one with the nullable arrays as weights. I use fit(Histogram,...)because it is really fast. I had written some routines that would basically perform calculations for each pixel (I should rather say voxel, the coordinate space is 4D) like summation of intensities of several datasets, normalisation by monitorcounts, errors, etc. My implementation was rather slow. I suddenly realized that fit(Histogram,...) could do that much faster as it already implements optimized routines. Then, it is only a matter of matrix operations, and you could even start think about interactive plotting since everything is already neatly sorted in a matrix representation that is accepted by matplotlib's imshowor similar 2D maps.

0kto commented 7 years ago

Well, I guess that is a hack to get things to work :) not having to deal with NAs at that level makes things easier, but I guess I could convert everything from NA to 0, and apply a mask to the endresults.

rofinn commented 7 years ago

I guess we could have it convert NAs to 0 weights.

Seems like that should be a separate function though (e.g., WeightVec(replace!(df[:A], 0.0))). FWIW, we do enough of that stuff at my work that we have a package called Impute.jl... not sure if anyone else would find that useful though.

0kto commented 7 years ago

Actually purging every row with NA signal before handing everything to Histogram is the best option for my aims. In that case would you mind noting in the docs that weights must not be NAs? Thanks for you help in this matter!

nalimilan commented 7 years ago

We can't really mention NA in the docs since StatsBase does not depend on DataArrays at all. Hopefully we will soon have a common concept of missingness in Julia Base, which will also replace DataArrays, at which point it will be possible to explain this, and maybe throw an explicit error.

FWIW, replace! doesn't work on DataArray but you can do WeightVec(convert(Vector, df[:A], 0)).

Also see https://github.com/JuliaStats/DataArrays.jl/pull/249.