JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

Random generation of NA values #80

Closed milktrader closed 11 years ago

milktrader commented 11 years ago

julia> C = DataFrame(quote

   open  = float64(B[:,2])
   high  = float64(B[:,3])
   low   = float64(B[:,4])
   close = float64(B[:,5])
   end);

julia> head(C) DataFrame (6,4) open high low close [1,] 567.17 572.0 562.6 571.5 [2,] NA NA 556.6 561.7 [3,] NA NA NA 560.91 [4,] NA NA 539.88 NA [5,] 525.2 530.0 505.75 527.68 [6,] 537.53 539.5 522.62 525.62

julia> C = DataFrame(quote

   open  = float64(B[:,2])
   high  = float64(B[:,3])
   low   = float64(B[:,4])
   close = float64(B[:,5])
   end);

julia> head(C) DataFrame (6,4) open high low close [1,] 567.17 572.0 562.6 571.5 [2,] 564.25 567.37 556.6 561.7 [3,] 571.91 NA 554.58 560.91 [4,] 540.71 NA 539.88 565.73 [5,] 525.2 NA 505.75 527.68 [6,] 537.53 539.5 NA 525.62

So overwriting the DataFrame generates a different set of NAs. And there are no NAs in B.

julia> sum(float64(B[:,2:5])) 2.4572281399999964e6

Some more info about B:

julia> size(B) (7116,7)

julia> typeof(B) Array{Any,2}

johnmyleswhite commented 11 years ago

I've seen this as well, but had trouble producing a reliable example to debug. It would be great if you could provide an example that, with probability p, produces an NA where none belongs.

milktrader commented 11 years ago

Do you mean something like

j = ones(100) J = DataFrame(quote foo = j end);

and then sum(na.omit(J)) #no na.omit na.rm functions though

BTW, I converted the type from Any to Float64 outside of the DataFrame call and got the same result, so that's a dead end.

johnmyleswhite commented 11 years ago

Yup, that's exactly what I meant. I'm pretty sure there's a serious NA-inducing bug in our code somewhere.I used to think it was in based_on, but now I get this bug even when running something like

df = dfones(100, 1); any(isna(df))

I suspect we're claiming dirty memory for the missingness metadata and then using it without initialization.

milktrader commented 11 years ago

I ran that above code three times and got 1, 25 and 14 NAs.

The only way I could find them was with summary(J). I don't have the functions nareplace() or dfones() for that matter.

johnmyleswhite commented 11 years ago

Yeah, it's totally random just like dirty memory would be.

I just updated METADATA.jl to reflect the changes in DataFrames.jl. That will give you dfones().

For finding these bugs, the following function will help:

function isna(df::DataFrame)
    results = Array(Bool, size(df))
    for i in 1:nrow(df)
        for j in 1:ncol(df)
            results[i, j] = isna(df[i, j])
        end
    end
    return results
end
johnmyleswhite commented 11 years ago

Still work to do, but I suspect the culprit is line 243 of datavec.jl:

DataVec(x::Vector) = DataVec(x, BitArray(length(x)))

Replacing it with the following seems to fix things:

function DataVec(x::Vector)
    n = length(x)
    is_missing = BitArray(n)
    for i in 1:n
        is_missing[i] = false
    end
    DataVec(x, is_missing)
end
HarlanH commented 11 years ago

Oh, if BitArray(n) isn't return initialized values, that'll do it.

Ah, yes, that's the case:

julia> any(BitArray(100)) false

julia> any(BitArray(100000)) true

Use this instead:

julia> any(bitfalses(100000)) false

On Wed, Nov 28, 2012 at 3:35 PM, John Myles White notifications@github.comwrote:

Still work to do, but I suspect the culprit is line 243 of datavec.jl:

DataVec(x::Vector) = DataVec(x, BitArray(length(x)))

Replacing it with the following seems to fix things:

function DataVec(x::Vector) n = length(x) is_missing = BitArray(n) for i in 1:n is_missing[i] = false end DataVec(x, is_missing) end

— Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/DataFrames.jl/issues/80#issuecomment-10820299.

johnmyleswhite commented 11 years ago

Yup. It's fixed. I'll push a change and a new test in a second.

milktrader commented 11 years ago

I put the function into datavec.jl and it works with the j = ones(100) example. And I don't see any NAs floating around my 7,000 row example either. Cool.

milktrader commented 11 years ago

so what is the workflow for getting the new datavec.jl? Pkg.update() ?

johnmyleswhite commented 11 years ago

That will be the workflow once I get home and can update METADATA.jl.

johnmyleswhite commented 11 years ago

Fixed and updated METADATA.jl as well. Pkg.update should work, although it may freak out since you've edited your own copy of the package.

milktrader commented 11 years ago

julia> Pkg.update() Already up-to-date. remote: Counting objects: 17, done. remote: Compressing objects: 100% (7/7), done. remote: Total 12 (delta 2), reused 11 (delta 1) Unpacking objects: 100% (12/12), done. From git://github.com/JuliaLang/METADATA.jl f758286..c1486f2 master -> origin/master Updating f758286..c1486f2 Fast-forward DataFrames/versions/0.0.0/sha1 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Already up-to-date.

Every time I see Fast-forward I cringe in pain, but all appears well

johnmyleswhite commented 11 years ago

I think we're all doing a lot of things in Git that are not kosher. Definitely our squashing histories and then doing git push -f origin master makes GitHub docs unhappy.

milktrader commented 11 years ago

trying to convert the ecdat garch dataset time column (number 2) into a formatted time string.

First impulse is to vectorize ...

julia> garch[:,2] = strftime("%Y-%m-%d",garch[:,2]) no method assign(DataVec{Int64},ASCIIString,Range1{Int64}) in method_missing at base.jl:79 in assign at /Users/Administrator/.julia/DataFrames/src/dataframe.jl:511

Okay, fine. Let's loop it then because we has fastness.

julia> for i in 1:size(garch, 1) garch[i,2] = strftime("%Y-%m-%d", garch[i,2]) end no method assign(DataVec{Int64},ASCIIString,Int64) in method_missing at base.jl:79 in assign at /Users/Administrator/.julia/DataFrames/src/dataframe.jl:511 in anonymous at no file:2

wat?

johnmyleswhite commented 11 years ago

Our assignment system is kind of bonkers at the moment. I've been doing some work on it, but I'll take at least a week to get it clean. The problem is that we've defined assign usually things like:

# assign variants
# x[3] = "cat"
function assign{T}(x::DataVec{T}, v::T, i::Int)
    x.data[i] = v
    x.na[i] = false
    return x[i]
end

Really, we want something more like:

# assign variants
# x[3] = "cat"
function assign{S, T}(x::DataVec{S}, v::T, i::Int)
    x.data[i] = v
    x.na[i] = false
    return x[i]
end

Basically we need to not get in the way of Julia's already bad-ass promotion system. Doing this right is hard, but essential. Thanks for pointing this one out! Glad you're stumbling on the same things as I am. Can you open a new issue for this?

milktrader commented 11 years ago

yes

johnmyleswhite commented 11 years ago

Actually, looking at your example, are you trying to assign a string value to an Int64 column? Because that's definitely something even pure Julia is going to be pissed about.

HarlanH commented 11 years ago

Oh, yeah. You can't do an element-by-element replace of a DV with a different type. I can't remember whether you can replace an entire column right now with a different type. You should be able to in theory, but I don't know if it's implemented. You might have to do something like:

garch[:,2] = DataVec([strftime("%Y-%m-%d",x) for x in garch[:,2]])

And I'm not even sure if that works, although if it doesn't, we should find a way to make it, or something like it, work.

johnmyleswhite commented 11 years ago

Agreed: we need to find a way to make that kind of whole-column reassignment work.