apache / arrow-julia

Official Julia implementation of Apache Arrow
https://arrow.apache.org/julia/
Other
284 stars 59 forks source link

Reading only a subset of columns #78

Open CarlColglazier opened 3 years ago

CarlColglazier commented 3 years ago

Please correct me if this is possible already. I looked through the source code and the documentation and did not find a clear way to do this: basically, I want to read a FeatherV2 file, but not mmap every single column. I already know which columns I need and I'd like to tell Arrow.Table the subset of columns I want read into memory.

This is similar to this issue on Feather.jl.

This seems to be possible in the R arrow package using col_select.

quinnj commented 3 years ago

Hey @CarlColglazier, thanks for opening an issue. We could probably support keyword arguments like select and drop, but note that it wouldn't change how much memory is "mmapped". Arrow tables are stored in a single memory blob and there isn't really a way to only mmap a few columns. You still have to read the header/metadata to figure out the offsets of specific columns into the data.

So, happy to support select/drop, since it can be convenient to only get back the columns you really need, but I just want to point out that I wouldn't expect there to be any real effect on memory/performance.

JayjeetAtGithub commented 3 years ago

I went through the feather c++ source code and it seems this hasn't been fixed yet in the upstream C++ api. Am i right ?