JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

code abstraction for Arrow #72

Closed ExpandingMan closed 6 years ago

ExpandingMan commented 6 years ago

This issue is for tracking progress related to re-organizing Feather.jl to rely on an underlying "Arrow" object. Ultimately we will split off the implementation of Arrow into a separate Arrow.jl module.

Initially, the goal will be to create an ArrowBuffer (naming tentative) struct which implements the arrow design specification. With this done, Feather.jl will essentially contain DataStreams Source and Sink for writing to and from files.

davidanthoff commented 6 years ago

I completely agree with this, it would be fantastic to have an Arrow.jl package that doesn't depend on either DataStreams. or TableTraits.jl, but just provides the underlying types and low level methods, and then Feather.jl could have the DataStreams.jl implementation and FeatherFiles.jl the TableTraits.jl implementation, but both would use the underlying Arrow.jl package.

ExpandingMan commented 6 years ago

That's the plan.

ExpandingMan commented 6 years ago

You can see my progress here. So far I have partial implementations of primitive arrays and lists. My approach is to make these AbstractVectors with all the methods thereof.

One thing that I am very confused about is metdata and to what extent it should be standaradized. The Feather metadata format seems to have nothing to do with the metadata format that they talked about in Arrow. Is metadata entirely up to the implementation? If so it limits how much can be implemented directly in Arrow.jl. Right now I do not know whether Arrow.jl should contain any code for handling metadata at all.

Right now I only have methods for reading data. Once I am sure that things are basically working (that I can read feather files) I will work on write methods.

ExpandingMan commented 6 years ago

For those interested, I'm now mostly done with the read side of the implementation. If you want you can check out Arrow.jl and my arrow1 branch of Feather.jl and read in Feather files (DataStreams not implemented yet, but you can create dataframes that are basically just views of the data).

I haven't implemented categorical data, so next I have to do that and writing (writing actually should be pretty easy at this point).

ExpandingMan commented 6 years ago

I've now completed essentially all of the read side for both Arrow and Feather (except for Arrow structs and the DataStreams interface). Right now everything is read from the data directly in the correct binary format. I'd like to make it easy to do automatic conversions, but I'm still thinking about how to do that (this is important for datetime, for example, in which the underlying binary format in Arrow is different than in Julia).

Next I'm going to implement writing in both arrow and feather.

ExpandingMan commented 6 years ago

See #78.

ExpandingMan commented 6 years ago

Now merged to master.