JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

how do we deal with `Int32` overflows? #75

Open ExpandingMan opened 6 years ago

ExpandingMan commented 6 years ago

I've been reading up on what happened over on the Python side to deal with arrays with length that excedes typemax(Int32). This is a problem because Arrow specifies that all lengths should be described with Int32 (they say this is to keep people from having contiguous arrays that are too large, but I have to say that it's hard for me to see this as a good justification for this particular case). This issue and this PR sort of make it seem as if they are indeed storing contiguous arrays of length greater than typemax(Int32) but they somehow read them in chunks. I've been avoiding digging through their C++ code in great detail, but I was wondering if @quinnj perhaps has some insight on how this issue might be dealt with. It would seem to me that the most sensible solution would be to change the Feather metadata so that a column can be composed of arbitrarily many Arrow "primitive arrays" but that doesn't seem to be what they've done at all.

davidanthoff commented 6 years ago

There is a pretty active arrow mailing list, maybe ask there?

davidanthoff commented 6 years ago

@wesm would be fantastic if you could help us out here. Essentially we are trying to understand what the correct way to handle more than 2^31 - 1 rows in arrow and feather is. I've read through the arrow spec and that didn't really clear it up for me.

wesm commented 6 years ago

We've allowed 64-bit Arrow lengths for almost a year. Keeping things under INT32_MAX is a guideline, though, for better compatibility with Java. Could you let me know what problem you're having?

Would you all like to join the Apache Arrow project at some point? It would facilitate cross-language collaboration.

davidanthoff commented 6 years ago

Thanks! So the info in the "Array lengths" section in https://arrow.apache.org/docs/memory_layout.html is just outdated and we should ignore that?

Can we also just use 64-bit lengths in Feather and still be fully compatible with the python and R implementation?

I'm not entirely clear what joining the arrow project means, but as long as I don't commit to do work it sounds great. I would certainly like to strengthen the cross-language collaboration, and if I can help do that from julia's side of things, I'm game.

wesm commented 6 years ago

I'm not entirely clear what joining the arrow project means

It means building a Julia developer community in the Apache project (https://www.apache.org/foundation/how-it-works.html). It's about community and governance more than anything. As one symptom of working over here by yourselves, you aren't interacting with the other Arrow developers very often.

Yes -- use 64-bit lengths. The Feather metadata uses 64 bits. I will open a JIRA about updating that document about array lengths (64-bit permitted, < 32-bit preferred)

davidanthoff commented 6 years ago

Cool, thanks!

And also thanks for the pointer on the Apache project. I totally agree that we (the julia community) should interact much more with things that are going on in these other communities. I'll have to find some time to dig through all the material in that link, but it sure looks super useful. I have some more ambitious plans around all of this here on the Berkeley campus, I'll make sure to reach out if that moves beyond vague ideas.

ExpandingMan commented 6 years ago

Awesome! This makes things incredibly simple, essentially no work is required except to ensure that lengths are Int64 on all machines which is trivial.

I agree that we should try to get involved with Apache, though I don't expect my own interest in this stuff to go beyond my commitment to getting Arrow.jl and Feather.jl up and running reliably. (Am curious to hear @davidanthoff grand plans though :wink:)

quinnj commented 6 years ago

Yes, thanks for chiming in here @wesm. In the past, various limitations on julia package registration/installation prevented the code here (or Arrow) from being properly integrated upstream, but improvements on julia master (and soon to be released 0.7/1.0) allow for more flexible arrangements, such as letting the code live side-by-side other implementations. We certainly have interest in integrating efforts with the parent project, so as things progress here, we can work on the integration as well.

wesm commented 6 years ago

I wouldn't let issues related to the structure of the codebase(s) get in the way. If we need to create you a separate apache/arrow-julia repo on an interim basis, we can do that. The important thing is having the community working together and helping each other

quinnj commented 6 years ago

Sounds great! I'll work w/ @ExpandingMan and we can get the arrow-julia code he's been working on in a state to integrate.

wesm commented 6 years ago

OK, the main thing I will ask is that you keep track of whose IP is involved. There is code living in multiple repos from what I can tell, and so the import / clearance process may be somewhat complex (we have to verify copyright and distribution rights for all code imported into an Apache project)

sglyon commented 6 years ago

This sounds great! Thanks Wes

I just (happily) watched the process of merging in the go implementation and I think moving the Julia one under the apache umbrella would be great.

All: we should put together a todo list of what needs to happen to get this "in a state to integrate." I can likely pitch in, but would spend my time better if there was a specific set of tasks we were working towards

ExpandingMan commented 6 years ago

Arrow.jl was created from scratch by me using only the one arrow formatting document and the existing Feather.jl as reference. I deliberately didn't even look at the C++ source code because my goal was to make as "Julian" a package as possible.

I'm not sure what would be involved with apache/arrow-julia but my preference would certainly be that the Arrow.jl package remains under that name somewhere, ideally under one of the Julia github collaborations, as I feel that's where it would be most visible and useful to the Julia community. I'm assuming that would still be compatible with the plans for integrating with Apache?

ExpandingMan commented 6 years ago

@wesm , what do you guys do about the offset values (i.e. for string columns)? I don't see any way in the Feather metadata to specify their type. As far as I can tell the C++ feather implementation always seems to make the offset values Int32, but if you have a buffer larger than the largest Int32 clearly you need them to be able to be Int64.

wesm commented 6 years ago

@ExpandingMan Feather is limited to < 2 GB of total string data per column. This could be fixed by changing the Feather format to use the Arrow streaming/file messaging formats (i.e. chunked files instead of monolithic columns) internally. I'd like to do this as soon as we are able to ship R bindings of Arrow. Presumably Julia would follow. It would be nice to have the Julia code in the same place so we can do integration testing of patches

ExpandingMan commented 6 years ago

Ah, I see, so you are maintaining that limit. That was the whole reason we were asking about the Int32 overflows in the first place, we didn't know how you wanted to deal with large data. Certainly arrow had made it clear that everything should be Int32.

Anyway, I had been giving some thought to how to write more data than can fit in memory and I was indeed thinking that multiple feather files would be the absolute simplest thing to do. Fortunately, in Julia implementing that should be really simple, we'll just need to know what the metadata will look like. From our perspective it would be perfectly fine if every file has a FlatBuffer of some kind, reading them in has very low overhead.