JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Feather.read fails on bigger data #107

Closed jangorecki closed 5 years ago

jangorecki commented 5 years ago

I am trying to read feather file, generated based on csv of size 40-50GB. There are 1e9 rows and 9 columns. Feather freshly installed from default repository.

df = Feather.read("data/G1_1e9_2e0_0_0.fea");
#ERROR: BoundsError: attempt to access 21942502592-element Array{UInt8,1} at index [27795022160]
Stacktrace:
 [1] throw_boundserror(::Array{UInt8,1}, ::Tuple{Int64}) at ./abstractarray.jl:484
 [2] checkbounds at ./abstractarray.jl:449 [inlined]
 [3] check_buffer_bounds at /home/jan/.julia/packages/Arrow/b4oSO/src/utils.jl:169 [inlined]
 [4] Type at /home/jan/.julia/packages/Arrow/b4oSO/src/primitives.jl:36 [inlined]
 [5] locate at /home/jan/.julia/packages/Arrow/b4oSO/src/locate.jl:59 [inlined]
 [6] locate at /home/jan/.julia/packages/Arrow/b4oSO/src/locate.jl:81 [inlined]
 [7] locate at /home/jan/.julia/packages/Arrow/b4oSO/src/locate.jl:76 [inlined]
 [8] constructcolumn(::Type{Float64}, ::Array{UInt8,1}, ::Nothing, ::Feather.Metadata.Column) at /home/jan/.julia/packages/Feather/92Jkl/src/source.jl:159
 [9] constructcolumn(::Feather.Source{NamedTuple{(:id1, :id2, :id3, :id4, :id5, :id6, :v1, :v2, :v3),Tuple{String,String,String,Int32,Int32,Int32,Int32,Int32,Float64}}}, ::Type{Float64}, ::Int64) at /home/jan/.julia/packages/Feather/92Jkl/src/source.jl:165
 [10] constructcolumn(::Feather.Source{NamedTuple{(:id1, :id2, :id3, :id4, :id5, :id6, :v1, :v2, :v3),Tuple{String,String,String,Int32,Int32,Int32,Int32,Int32,Float64}}}, ::Int64) at /home/jan/.julia/packages/Feather/92Jkl/src/source.jl:167
 [11] constructall(::Feather.Source{NamedTuple{(:id1, :id2, :id3, :id4, :id5, :id6, :v1, :v2, :v3),Tuple{String,String,String,Int32,Int32,Int32,Int32,Int32,Float64}}}) at /home/jan/.julia/packages/Feather/92Jkl/src/source.jl:170
 [12] Feather.Source(::String, ::Type{NamedTuple{(:id1, :id2, :id3, :id4, :id5, :id6, :v1, :v2, :v3),Tuple{String,String,String,Int32,Int32,Int32,Int32,Int32,Float64}}}, ::Feather.Metadata.CTable, ::Array{UInt8,1}) at /home/jan/.julia/packages/Feather/92Jkl/src/source.jl:13
 [13] #Source#4(::Bool, ::Type, ::String) at /home/jan/.julia/packages/Feather/92Jkl/src/source.jl:20
 [14] Type at ./none:0 [inlined]
 [15] #read#7(::Bool, ::Function, ::String) at /home/jan/.julia/packages/Feather/92Jkl/src/source.jl:67
 [16] read(::String) at /home/jan/.julia/packages/Feather/92Jkl/src/source.jl:67
 [17] top-level scope at none:0
jangorecki commented 5 years ago

writing too

Feather.write("./data/G1_1e9_1e2_0_0_jl.fea", x)        
ERROR: InexactError: trunc(Int32, 2147483652)
ExpandingMan commented 5 years ago

Sorry for the delayed response. Unfortunately the Feather format itself does not support saving individual files larger than 4GB. I know, this really sucks. As far as I know the long term plan was that eventually there'd be some accommodation in the metadata for chaining together 4GB files.

My take on the situation is that Feather was developed before the Arrow format was mature, and what we wound up with is a format with rather messy metadata that doesn't really conform to the Arrow standard (which is why I think they've never addressed this). I've been toying with the idea of creating a new format where the metadata is fully compatible with the Arrow IPC metadata but as we haven't implemented the Arrow IPC metadata yet this would take a lot of work and I just haven't had the time to get into it yet.

Anyway, in the foreseeable future the only option is to break your data into 4GB chunks. I'm completely open to putting some "hack" into Feather.jl that does this more easily (and, most importantly, computes 4GB boundaries for the user, which is really the hardest part). I'll work on this myself if I ever really need it. If anyone else is interested in implementing it we'd welcome a PR.

jangorecki commented 5 years ago

I would suggest to give up on Feather and work out some alternative for serializing and deserializing dataframes. R's feather package is not even able to load 380MB files raising Error: C stack usage 7971012 is too close to the limit exception. In python it segfault on 19GB data. Trying JLD2 now. Closing as this 4GB hard limit it is feather format issue, not Feather.jl really.