Open DrFlamingoo opened 7 years ago
This was an intentional change introduced in #248. After a bit of discussion we decided that it doesn't make sense to have a weight vector that allows contains missing values, since the sum of the weights can would be missing. I think we were under the impression that it wasn't a common use case to have DataArray
s passed to the WeightVec
constructor, though I hadn't considered when data is coming from a DataFrame
. You'll now need to convert any columns you had intended to be used for weighting into normal Vector
s.
Do you have anything to add here, @nalimilan?
@ararslan Would it make sense to add automatic conversion for DataArray
s and I guess NullableArray
s?
We don't want this package to depend on DataArrays or NullableArrays, so the best we could do would be to add a conversion to Vector
inside the constructor, which will fail if there are any null values.
@ararslan I don't see why #248 would exclude the use of DataVector
s. Am I missing something?
Oh, no, you're right. I'm actually not sure why this is happening.
@0kto Are there missing values in df_test[:I]
?
julia> d = @data([1, 2, NA, 4])
4-element DataArrays.DataArray{Int64,1}:
1
2
NA
4
julia> weights(d)
ERROR: MethodError: no method matching StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}(::DataArrays.DataArray{Int64,1}, ::DataArrays.NAtype)
Closest candidates are:
StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}{V<:AbstractArray{T<:Real,1}}(::V<:AbstractArray{T<:Real,1}) at /Users/rory/.julia/v0.5/StatsBase/src/weights.jl:63
StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}{S<:Real,V<:AbstractArray{T<:Real,1}}(::V<:AbstractArray{T<:Real,1}, ::S<:Real) at /Users/rory/.julia/v0.5/StatsBase/src/weights.jl:63
StatsBase.Weights{#7#S<:Real,#8#T<:Real,#9#V<:AbstractArray{T<:Real,1}}{T}(::Any) at sysimg.jl:53
in weights(::DataArrays.DataArray{Int64,1}) at /Users/rory/.julia/v0.5/StatsBase/src/weights.jl:71
vs
julia> d = @data([1, 2, 4])
3-element DataArrays.DataArray{Int64,1}:
1
2
4
julia> weights(d)
3-element StatsBase.Weights{Int64,Int64,DataArrays.DataArray{Int64,1}}:
1
2
4
Yes, the not all coordinates carry a signal
Ahh, okay. Yeah, that behaviour is intentional, but I could be convinced otherwise.
I maintain that it doesn't really make sense to allow the case where the sum of the weights is missing. What behavior would you expect that to exhibit?
Since DataArrays depends on StatsBase, perhaps we should add an explicit WeightVec
constructor for DataArray
input in that package, though I'm not sure what such a constructor should do exactly. I guess we could have it convert NA
s to 0 weights.
I am home, so I can write a bit more detailed.
After reading your comments I identify two problems for my use case:
The first is that even for dataTuple a DataFrame column is not allowed. That means that
fit(Histogram,dataTuple)
will cause an error if dataTuple is constructed directly from DataFrames.
The second issue is the one with the nullable arrays as weights.
I use fit(Histogram,...)
because it is really fast. I had written some routines that would basically perform calculations for each pixel (I should rather say voxel, the coordinate space is 4D) like summation of intensities of several datasets, normalisation by monitorcounts, errors, etc. My implementation was rather slow.
I suddenly realized that fit(Histogram,...)
could do that much faster as it already implements optimized routines. Then, it is only a matter of matrix operations, and you could even start think about interactive plotting since everything is already neatly sorted in a matrix representation that is accepted by matplotlib's imshow
or similar 2D maps.
Well, I guess that is a hack to get things to work :) not having to deal with NAs at that level makes things easier, but I guess I could convert everything from NA to 0, and apply a mask to the endresults.
I guess we could have it convert NAs to 0 weights.
Seems like that should be a separate function though (e.g., WeightVec(replace!(df[:A], 0.0))
). FWIW, we do enough of that stuff at my work that we have a package called Impute.jl... not sure if anyone else would find that useful though.
Actually purging every row with NA signal before handing everything to Histogram is the best option for my aims. In that case would you mind noting in the docs that weights must not be NAs? Thanks for you help in this matter!
We can't really mention NA
in the docs since StatsBase does not depend on DataArrays at all. Hopefully we will soon have a common concept of missingness in Julia Base, which will also replace DataArrays, at which point it will be possible to explain this, and maybe throw an explicit error.
FWIW, replace!
doesn't work on DataArray
but you can do WeightVec(convert(Vector, df[:A], 0))
.
Also see https://github.com/JuliaStats/DataArrays.jl/pull/249.
Hi all, I am relying quite a bit on Histograms in my scripts for my PhD, and I often calculate intensities of 2D detectors by supplying coordinates
x
,y
and the signal intensityI
as weight to thefit(Histogram,(x,y),weights(I)
. This is immideatly plottable via PyPlotsplt.imshow()
, etcSince I get my data from DataFrames, a typical command looks like:
This stopped working, instead I have to convert my DataFrames:
was that intentional?