JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Added random access streamfrom field methods #54

Closed ExpandingMan closed 6 years ago

ExpandingMan commented 6 years ago

I have changed the Data.streamfrom methods for Data.Field to do random access of the memory mapped data rather than pull entire columns.

I'm still doing something stupid with the Bools as I still am not too sure how to do it properly.

There is a fair amount of overhead involved with computing pointers for individual fields, but nevertheless this still seems pretty fast.

quinnj commented 6 years ago

Looks like a great start! I'm trying to update other parts of the ecosystem at the moment, but so far this looks promising. Let me know if you run into any issues. Let's make sure to get some tests in for this new functionality.

ExpandingMan commented 6 years ago

Thanks, actually I have been using this pretty regularly for quite a long time now, so it's actually more well tested than most of what I write.

I haven't really looked into what's going on with the Travis tests, it seems that some of the testing is still based on the old setup.

One thing I'm not quite sure of is whether this fits in with the DataStreams standard. I made the Data.streamfrom methods for Data.Field do random access by default (actually, at the moment that's all they do). This allows Data.streamfrom to be completely standalone, and it doesn't keep track of any sort of ongoing streaming process. In my experience this is actually preferable just about 100% of the time and last I checked DataFrames still worked that way.

ExpandingMan commented 6 years ago

I've also just added a bunch of fixes for some datetime related stuff. Sorry, it really should have been in an independent branch, but worst case scenario we can always just copy and paste the stuff.

ExpandingMan commented 6 years ago

Hi @quinnj. I think that DataStreams is now in a state that I can re-do this from scratch and we can support both random access and streaming entire columns. Would you be interested in merging if I take the time to write this with up-to-date DataStreams implementation?

quinnj commented 6 years ago

hey @ExpandingMan, sorry for the slow response here. I think it's probably ok to move ahead here. My one hesitation is I'd like to do a proper separation between a true Arrow.jl implementation (a la https://arrow.apache.org/) and then have Feather.jl just be the disk-IO of that. I'm a tad swamped for at least 2-3 weeks though working on other things before I could get to that. If you'd like to go ahead here, I'm happy to review/merge; or if you'd like to wait (or help!) w/ separating Arrow.jl, that's fine too.

ExpandingMan commented 6 years ago

I'd be happy to help with splitting off Arrow, but I think I'd find that rather difficult as it's not entirely clear to me what belongs in the Arrow module as opposed to Feather itself.

How I think this would work would be that the Arrow module would deal with reading and writing from a single Vector{UInt8} (or an abstraction thereof) with a metadata flatbuffer, and Feather would basically be left with doing file IO and the DataStreams implementation. Let me know your thoughts.

Since Feather has changed so much since I originally made this PR, I'm going to close it and open another one when I'm ready.

quinnj commented 6 years ago

I don't think Arrow.jl would need to do much, if any, IO. Arrow.jl would purely be an implementation of the apache arrow spec, which defines in-memory byte layouts for various structures. Part of Feather's issue right now is we encode most of the layout transformations in the IO itself, instead of defining actual Arrow structures that would conform to the layout. For example, we would define an Arrow.PrimitiveArray, Arrow.List, Arrow.Struct, etc. that mirror the objects described in https://github.com/apache/arrow/blob/master/format/Layout.md.

The arrow spec does define a Message type and has some notes on IPC IO using the messages, but I haven't dug into that. In any case, the arrow spec itself doesn't say much (as far as I know and for the moment) about actual disk IO, which is where Feather comes in. So Feather should really just be a matter of defining read/write methods on Arrow structures.

ExpandingMan commented 6 years ago

Ok, yeah it would definitely be really nice to abstract some of these functions more. As things are right now, it's very hard to follow how Feather.jl works. There is a huge amount of functionality crammed into functions like Data.streamfrom and writecolumn; ultimately these functions should probably be one-liners that call functions from more fundamental underlying objects (an abstraction of the Arrow).

As a first step, if you don't mind I'd really like to reorganize Feather.jl a bit, so things are a little more separated out and easier to find. I'm about to have a PR that will make some organizational changes but no substantive changes. Hopefully, this will make it a little easier for new people (and myself) to contribute. From there, we can start figuring out how this should look with Arrow as a separate module.

quinnj commented 6 years ago

that sounds good to me. thanks!