JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.73k stars 367 forks source link

Enable DataFrames to handle CalendarTime type #98

Closed milktrader closed 11 years ago

milktrader commented 11 years ago

The CalendarTime type is defined as:


type CalendarTime
    millis::Float64
    tz
end

It's fairly new and hasn't made it to the package distribution system yet, but it's a clean and simple implementation.

The goal is to have a column in a DataFrame be a time type but DataFrames needs an Int64 where CalendarTime is a Float64.

julia> load("~/git/Calendar.jl/src/Calendar.jl")

julia> using Calendar

julia> df = DataFrame(quote
       time = now()
       end)
no method convert(Type{CalendarTime},Int64)
 in method_missing at base.jl:81
 in DataVec at /Users/Administrator/.julia/DataFrames/src/datavec.jl:37
 in assign at /Users/Administrator/.julia/DataFrames/src/dataframe.jl:500
 in DataFrame at /Users/Administrator/.julia/DataFrames/src/dataframe.jl:149
 in anonymous at no file:849
 in based_on at /Users/Administrator/.julia/DataFrames/src/dataframe.jl:856
 in DataFrame at /Users/Administrator/.julia/DataFrames/src/dataframe.jl:28

Why does DataFrames want to convert CalendarTime to Int64?

This is further puzzling because DataFrames does accept a Float64.


julia> df = DataFrame(quote
       float = 3.14
       end)
DataFrame  (1,1)
        float
[1,]     3.14

julia> typeof(df[1])
DataVec{Float64}
johnmyleswhite commented 11 years ago

I think the macro you're using is probably a little broken for sophisticated types:

 df = DataFrame(quote
       time = now()
       end)

I think you might potentially have success if you did:

 df = DataFrame(quote
       time = DataVec([now()])
       end)

But we still need to change DataVec's to support ad hoc types. I'll make an RFC tonight that does away with the current naRule, replaceVal architecture. That will make it trivial to create ad hoc DataVec's.

johnmyleswhite commented 11 years ago

This should work now. Calendar.jl seems temporarily broken, but the machinery for this is now in place. Please re-open if it fails on your end.

milktrader commented 11 years ago

Awesome, thanks!

I'm putting in Array{UTF16String} into my "Date" column and it's working in REPL.

milktrader commented 11 years ago

This works:


julia> a = csvread("data/AAPL.csv")
7117x7 Any Array:
"Date"           "Open"     "High"     "Low"     "Close"   "Volume"     "Adj Close"
 "2012-11-23"  567.17     572.0      562.6     571.5       9.7438e6   571.5         
 "2012-11-21"  564.25     567.37     556.6     561.7       1.33215e7  561.7     

julia> b = convert(Array{UTF16String}, a[2:end,1])
7116-element UTF16String Array:
 "2012-11-23"
 "2012-11-21"
 "2012-11-20"

julia> c = map(x -> parse("yyyy-MM-dd", x), b)
7116-element CalendarTime Array:
 Nov 23, 2012 12:00:00 AM EST
 Nov 21, 2012 12:00:00 AM EST

julia> df = DataFrame(quote
         Date = $(c)
       end);

julia> head(df,2)
2x1 DataFrame:
                                Date
[1,]    Nov 23, 2012 12:00:00 AM EST
[2,]    Nov 21, 2012 12:00:00 AM EST

julia> typeof(df[1])
DataVec{CalendarTime}
HarlanH commented 11 years ago

That is just outstanding.

On Sat, Dec 15, 2012 at 1:29 AM, milktrader notifications@github.comwrote:

This works:

julia> a = csvread("data/AAPL.csv")7117x7 Any Array:"Date" "Open" "High" "Low" "Close" "Volume" "Adj Close" "2012-11-23" 567.17 572.0 562.6 571.5 9.7438e6 571.5 "2012-11-21" 564.25 567.37 556.6 561.7 1.33215e7 561.7 julia> b = convert(Array{UTF16String}, a[2:end,1])7116-element UTF16String Array: "2012-11-23" "2012-11-21" "2012-11-20" julia> c = map(x -> parse("yyyy-MM-dd", x), c)7116-element CalendarTime Array: Nov 23, 2012 12:00:00 AM EST Nov 21, 2012 12:00:00 AM EST julia> df = DataFrame(quote Date = $(c) end); julia> head(df,2)2x1 DataFrame: Date[1,] Nov 23, 2012 12:00:00 AM EST[2,] Nov 21, 2012 12:00:00 AM EST julia> typeof(df[1])DataVec{CalendarTime}

— Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/DataFrames.jl/issues/98#issuecomment-11400592.

johnmyleswhite commented 11 years ago

Does read_table not work for your file? I'm always trying to find things that break the reader, so please let me know if it doesn't.

milktrader commented 11 years ago
julia> foo = read_table("data/AAPL.csv");

julia> head(foo, 2)
2x7 DataFrame:
                Date   Open   High   Low Close   Volume Adj Close
[1,]    "2012-11-23" 567.17  572.0 562.6 571.5  9743800     571.5
[2,]    "2012-11-21" 564.25 567.37 556.6 561.7 13321500     561.7

julia> typeof(foo[:,1])
DataVec{ASCIIString}

julia> typeof(foo[:,2])
DataVec{Float64}

julia> map(x -> parse("yyyy-MM-dd", x),  foo[:,1])
no method map(Function,DataVec{ASCIIString})

julia> bar = map(x -> parse("yyyy-MM-dd", x), convert(Array{UTF16String}, foo[:,1]))
7116-element CalendarTime Array:
 Nov 23, 2012 12:00:00 AM EST
 Nov 21, 2012 12:00:00 AM EST

I should mention that the following requires DataFrames (duh) and UTF16.

This is really nice to be able to read it in this way. I now can throw away my functions that were doing this the hard way. I'm not sure how to convert ASCIIString from the original foo DataFrame into my CalendarTime type in-place though.

johnmyleswhite commented 11 years ago

Ah, we're missing map for DataVec's. I'll add that now. For now you could use:

DataVec(map(x -> parse("yyyy-MM-dd", x), vector(foo[:, 1])))

The use of vector to convert to Vector{ASCIIString} works because you (hopefully) don't have missing data. We need to do more on converting types in place within a DataVec. For time types we're going to need the API to settle down a bit before we'll have reliable support. My end goal is to have this just be:

foo[:, 1] = dvdatetime(foo[:, 1], "yyyy-MM-dd")
milktrader commented 11 years ago

That's great.

The read_table is a very elegant replacement for my hard-coded method:

stock = DataFrame(quote
      Date = $(cal[:])
      Op  = $(sym[:,1])
      Hi  = $(sym[:,2])
      Lo  = $(sym[:,3])
      Cl  = $(sym[:,4])
      Vol = $(sym[:,5])
      Adj = $(sym[:,6])
 end);

Which was obviously a hack to see if things work, but now I'm happy to say goodbye. Not all financial data comes in the Yahoo format (shocking I know).

johnmyleswhite commented 11 years ago

Glad you like it. read_table is still going to need a few iterations to make it faster and more powerful.

I've just pushed an update to DataFrames that does a basic map. For your example you'll still need to do something extra, because the map produces Any's.

FYI: This update changes read_table to use UTF8 instead of ASCII by default.

milktrader commented 11 years ago

Nice. Yeah I see map in my future. Calculating returns on a Close column for instance.