Closed ExpandingMan closed 6 years ago
I completely agree with this, it would be fantastic to have an Arrow.jl package that doesn't depend on either DataStreams. or TableTraits.jl, but just provides the underlying types and low level methods, and then Feather.jl could have the DataStreams.jl implementation and FeatherFiles.jl the TableTraits.jl implementation, but both would use the underlying Arrow.jl package.
That's the plan.
You can see my progress here. So far I have partial implementations of primitive arrays and lists. My approach is to make these AbstractVector
s with all the methods thereof.
One thing that I am very confused about is metdata and to what extent it should be standaradized. The Feather metadata format seems to have nothing to do with the metadata format that they talked about in Arrow. Is metadata entirely up to the implementation? If so it limits how much can be implemented directly in Arrow.jl. Right now I do not know whether Arrow.jl should contain any code for handling metadata at all.
Right now I only have methods for reading data. Once I am sure that things are basically working (that I can read feather files) I will work on write methods.
For those interested, I'm now mostly done with the read side of the implementation. If you want you can check out Arrow.jl and my arrow1 branch of Feather.jl and read in Feather files (DataStreams not implemented yet, but you can create dataframes that are basically just views of the data).
I haven't implemented categorical data, so next I have to do that and writing (writing actually should be pretty easy at this point).
I've now completed essentially all of the read side for both Arrow and Feather (except for Arrow structs and the DataStreams interface). Right now everything is read from the data directly in the correct binary format. I'd like to make it easy to do automatic conversions, but I'm still thinking about how to do that (this is important for datetime, for example, in which the underlying binary format in Arrow is different than in Julia).
Next I'm going to implement writing in both arrow and feather.
See #78.
Now merged to master.
This issue is for tracking progress related to re-organizing Feather.jl to rely on an underlying "Arrow" object. Ultimately we will split off the implementation of Arrow into a separate Arrow.jl module.
Initially, the goal will be to create an
ArrowBuffer
(naming tentative) struct which implements the arrow design specification. With this done, Feather.jl will essentially contain DataStreamsSource
andSink
for writing to and from files.