JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Sl/nonstring categories #79

Closed sglyon closed 6 years ago

sglyon commented 6 years ago

well... this is probably irrelevant given the work in #78, but this adds support for reading in categorical data whose value type is not string

ExpandingMan commented 6 years ago

Cool! By the way, in case you were curious, I do support this in #78, however in #78 I have deprecated categorical references of types other than Int32, as Int32 is specified in the Arrow standard.

sglyon commented 6 years ago

Nice! I took a look at your PR (very cool btw -- I think it is likely the future of this package) and had to adjust the code to accept non Int32 categorical references. I also noticed that Categorical columns containing strings was not working.

Hopefully I will find time to do a proper review of your PR

ExpandingMan commented 6 years ago

Thanks, I'd appreciate any help you want to offer. Right now I feel like the biggest priority is to do a proper performance pass of Arrow, but I'm hesitant because changes to reinterpret in 0.7 will completely change things.

Anyway, in what case are non-Int32 categorical references an issue for you? It's actually not difficult for me to change Arrow.jl to support them, but I deliberately chose not to as I really wanted to get things standardized to Arrow. I was under the impression that the only thing that ever generated non-Int32 references is Feather.jl master, in which case it should really be deprecated if possible. (We don't want lots of stuff floating around which we generate in Julia and then isn't compatible with anything.)

Categorical columns containing strings most certainly should work, and is according to all the tests. Can you perhpas post an issue or something containing more detail? Are you absolutely sure that the references for those columns were Int32?

nalimilan commented 6 years ago

I guess we could use Int32 rather than UInt32 by default in CategoricalArrays if that allows interoperating with Arrow without any copy. (At some point, negative codes could be used to distinguish different types of missing values.)

ExpandingMan commented 6 years ago

I guess we could use Int32 rather than UInt32 by default in CategoricalArrays if that allows interoperating with Arrow without any copy

I really don't think it would be worth it just for this case. Arrow doesn't seem to be quite that universal, at least not yet. UInt32 is a perfectly reasonable choice. In most cases a copy would have to be made anyway because the indexes aren't the same (1-based by default in CategoricalArrays, always 0-based in Arrow).

nalimilan commented 6 years ago

In most cases a copy would have to be made anyway because the indexes aren't the same (1-based by default in CategoricalArrays, always 0-based in Arrow).

Indeed, I hadn't noticed that. Though if we used Int32 we could use 0 as a valid code and use strictly negative codes for missing values. Anyway, Arrow is not necessarily the most important reference, but if we don't have any strong reason to follow one convention or another, we may as well adopt one which is used elsewhere.

sglyon commented 6 years ago

Hey @ExpandingMan I will need to support non Int32 categorical references. I have some feather files written out with the latest version of pyarrow that contain references of type Int8.

I did notice that there is a huge performance difference between Feather.jl (either current version or the new one based on Arrow.jl) and pyarrow. I don't have the exact numbers in front of me but it was something like 200ms for pyarrow, 20 seconds for old Feather.jl, and 40 seconds for Feather.jl based on Arrow.jl

I'll hopefully find time to explore that more soon.

Categorical columns containing strings most certainly should work, and is according to all the tests. Can you perhpas post an issue or something containing more detail? Are you absolutely sure that the references for those columns were Int32?

The references were indeed not Int32 -- good hunch


Sorry for the brief replies here -- I'm a bit under the weather this week so finding time for day job and then this is difficult.

ExpandingMan commented 6 years ago

Yeah, sorry, I should have made things a little more clear about performance. First of all, I found a silly bug this afternoon that was causing some pretty poor performance, so please pull the most recent commit for Arrow.jl.

Anyway, I'm going to post an issue on Arrow.jl about performance. Basically the story is that on 0.6 you need to sacrifice some performance to get safety. In principle this should be fixable in 0.7 but I haven't been able to check it yet. Tomorrow I'm going to mess around again a bit with making it unsafe again and see how good I can get it. I'm not quite sure what's going on with pyarrow and python feather. I'm pretty sure that they are very good at avoiding copying, which I haven't yet gone quite that crazy trying to avoid here.

I'll post 2 issues about these subjects here and on Arrow.jl.

ExpandingMan commented 6 years ago

The functionality of this PR should now exist in master. Can you confirm and close this? Thanks.