Closed milktrader closed 11 years ago
I've seen this as well, but had trouble producing a reliable example to debug. It would be great if you could provide an example that, with probability p, produces an NA where none belongs.
Do you mean something like
j = ones(100) J = DataFrame(quote foo = j end);
and then sum(na.omit(J)) #no na.omit na.rm functions though
BTW, I converted the type from Any to Float64 outside of the DataFrame call and got the same result, so that's a dead end.
Yup, that's exactly what I meant. I'm pretty sure there's a serious NA-inducing bug in our code somewhere.I used to think it was in based_on
, but now I get this bug even when running something like
df = dfones(100, 1); any(isna(df))
I suspect we're claiming dirty memory for the missingness metadata and then using it without initialization.
I ran that above code three times and got 1, 25 and 14 NAs.
The only way I could find them was with summary(J). I don't have the functions nareplace() or dfones() for that matter.
Yeah, it's totally random just like dirty memory would be.
I just updated METADATA.jl to reflect the changes in DataFrames.jl. That will give you dfones()
.
For finding these bugs, the following function will help:
function isna(df::DataFrame)
results = Array(Bool, size(df))
for i in 1:nrow(df)
for j in 1:ncol(df)
results[i, j] = isna(df[i, j])
end
end
return results
end
Still work to do, but I suspect the culprit is line 243 of datavec.jl:
DataVec(x::Vector) = DataVec(x, BitArray(length(x)))
Replacing it with the following seems to fix things:
function DataVec(x::Vector)
n = length(x)
is_missing = BitArray(n)
for i in 1:n
is_missing[i] = false
end
DataVec(x, is_missing)
end
Oh, if BitArray(n) isn't return initialized values, that'll do it.
Ah, yes, that's the case:
julia> any(BitArray(100)) false
julia> any(BitArray(100000)) true
Use this instead:
julia> any(bitfalses(100000)) false
On Wed, Nov 28, 2012 at 3:35 PM, John Myles White notifications@github.comwrote:
Still work to do, but I suspect the culprit is line 243 of datavec.jl:
DataVec(x::Vector) = DataVec(x, BitArray(length(x)))
Replacing it with the following seems to fix things:
function DataVec(x::Vector) n = length(x) is_missing = BitArray(n) for i in 1:n is_missing[i] = false end DataVec(x, is_missing) end
— Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/DataFrames.jl/issues/80#issuecomment-10820299.
Yup. It's fixed. I'll push a change and a new test in a second.
I put the function into datavec.jl and it works with the j = ones(100) example. And I don't see any NAs floating around my 7,000 row example either. Cool.
so what is the workflow for getting the new datavec.jl? Pkg.update() ?
That will be the workflow once I get home and can update METADATA.jl.
Fixed and updated METADATA.jl as well. Pkg.update
should work, although it may freak out since you've edited your own copy of the package.
julia> Pkg.update() Already up-to-date. remote: Counting objects: 17, done. remote: Compressing objects: 100% (7/7), done. remote: Total 12 (delta 2), reused 11 (delta 1) Unpacking objects: 100% (12/12), done. From git://github.com/JuliaLang/METADATA.jl f758286..c1486f2 master -> origin/master Updating f758286..c1486f2 Fast-forward DataFrames/versions/0.0.0/sha1 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Already up-to-date.
Every time I see Fast-forward I cringe in pain, but all appears well
I think we're all doing a lot of things in Git that are not kosher. Definitely our squashing histories and then doing git push -f origin master
makes GitHub docs unhappy.
trying to convert the ecdat garch dataset time column (number 2) into a formatted time string.
First impulse is to vectorize ...
julia> garch[:,2] = strftime("%Y-%m-%d",garch[:,2]) no method assign(DataVec{Int64},ASCIIString,Range1{Int64}) in method_missing at base.jl:79 in assign at /Users/Administrator/.julia/DataFrames/src/dataframe.jl:511
Okay, fine. Let's loop it then because we has fastness.
julia> for i in 1:size(garch, 1) garch[i,2] = strftime("%Y-%m-%d", garch[i,2]) end no method assign(DataVec{Int64},ASCIIString,Int64) in method_missing at base.jl:79 in assign at /Users/Administrator/.julia/DataFrames/src/dataframe.jl:511 in anonymous at no file:2
wat?
Our assignment system is kind of bonkers at the moment. I've been doing some work on it, but I'll take at least a week to get it clean. The problem is that we've defined assign usually things like:
# assign variants
# x[3] = "cat"
function assign{T}(x::DataVec{T}, v::T, i::Int)
x.data[i] = v
x.na[i] = false
return x[i]
end
Really, we want something more like:
# assign variants
# x[3] = "cat"
function assign{S, T}(x::DataVec{S}, v::T, i::Int)
x.data[i] = v
x.na[i] = false
return x[i]
end
Basically we need to not get in the way of Julia's already bad-ass promotion system. Doing this right is hard, but essential. Thanks for pointing this one out! Glad you're stumbling on the same things as I am. Can you open a new issue for this?
yes
Actually, looking at your example, are you trying to assign a string value to an Int64 column? Because that's definitely something even pure Julia is going to be pissed about.
Oh, yeah. You can't do an element-by-element replace of a DV with a different type. I can't remember whether you can replace an entire column right now with a different type. You should be able to in theory, but I don't know if it's implemented. You might have to do something like:
garch[:,2] = DataVec([strftime("%Y-%m-%d",x) for x in garch[:,2]])
And I'm not even sure if that works, although if it doesn't, we should find a way to make it, or something like it, work.
Agreed: we need to find a way to make that kind of whole-column reassignment work.
julia> C = DataFrame(quote
julia> head(C) DataFrame (6,4) open high low close [1,] 567.17 572.0 562.6 571.5 [2,] NA NA 556.6 561.7 [3,] NA NA NA 560.91 [4,] NA NA 539.88 NA [5,] 525.2 530.0 505.75 527.68 [6,] 537.53 539.5 522.62 525.62
julia> C = DataFrame(quote
julia> head(C) DataFrame (6,4) open high low close [1,] 567.17 572.0 562.6 571.5 [2,] 564.25 567.37 556.6 561.7 [3,] 571.91 NA 554.58 560.91 [4,] 540.71 NA 539.88 565.73 [5,] 525.2 NA 505.75 527.68 [6,] 537.53 539.5 NA 525.62
So overwriting the DataFrame generates a different set of NAs. And there are no NAs in B.
julia> sum(float64(B[:,2:5])) 2.4572281399999964e6
Some more info about B:
julia> size(B) (7116,7)
julia> typeof(B) Array{Any,2}