JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Updating feather file crashes Julia and wipes file #93

Closed wfrgra closed 6 years ago

wfrgra commented 6 years ago

Reading a feather file followed by a write back to the same file fails:

julia> using DataFrames

julia> using Feather

julia> a = DataFrame(a=[1])
1×1 DataFrame
│ Row │ a │
├─────┼───┤
│ 1   │ 1 │

julia> Feather.write("test.feather",a)
Feather.Sink("test.feather", Data.Schema:
rows: 1  cols: 1
Columns:
 "a"  Int64, Feather.Metadata.CTable("", 1, Feather.Metadata.Column[Column("a", PrimitiveArray(INT64, PLAIN, 8, 1, 0, 8), 0, nothing, "")], 2, ""), IOStream(<file test.feather>), "", "", Arrow.ArrowVector[[1]])

julia> a = Feather.read("test.feather")
1×1 DataFrame
│ Row │ a │
├─────┼───┤
│ 1   │ 1 │

julia> Feather.write("test2.feather",a)
Feather.Sink("test2.feather", Data.Schema:
rows: 1  cols: 1
Columns:
 "a"  Int64, Feather.Metadata.CTable("", 1, Feather.Metadata.Column[Column("a", PrimitiveArray(INT64, PLAIN, 8, 1, 0, 8), 0, nothing, "")], 2, ""), IOStream(<file test2.feather>), "", "", Arrow.ArrowVector[[1]])

julia> Feather.write("test.feather",a)

signal (7): Bus error
in expression starting at no file:0
unknown function (ip: 0x7f1de2657b34)
unsafe_copyto! at ./array.jl:225 [inlined]
unsafe_copyto! at ./array.jl:244
writepadded at ./array.jl:741
writepadded at /home/will/.julia/packages/Arrow/AKOeX/src/arrowvectors.jl:313 [inlined]
writecontents at /home/will/.julia/packages/Feather/fGqdB/src/sink.jl:75 [inlined]
writecontents at /home/will/.julia/packages/Feather/fGqdB/src/sink.jl:84
writecolumn at /home/will/.julia/packages/Feather/fGqdB/src/sink.jl:91
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
writecolumn at /home/will/.julia/packages/Feather/fGqdB/src/sink.jl:95
writecolumns at /home/will/.julia/packages/Feather/fGqdB/src/sink.jl:97
close! at /home/will/.julia/packages/Feather/fGqdB/src/sink.jl:111
write at /home/will/.julia/packages/Feather/fGqdB/src/sink.jl:50
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:324
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:428
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:363 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:686
jl_interpret_toplevel_thunk_callback at /buildworker/worker/package_linux64/build/src/interpreter.c:799
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x7f1daf40d1df)
unknown function (ip: (nil))
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:808
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:787
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/builtins.c:622
eval at ./boot.jl:319
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85
run_backend at /home/will/.julia/packages/Revise/51oQc/src/Revise.jl:766
#58 at ./task.jl:259
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1829
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1536 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:268
unknown function (ip: 0xffffffffffffffff)
Allocations: 41173173 (Pool: 41166355; Big: 6818); GC: 89
[1]    15703 bus error (core dumped)  julia

The file test.feather is now 0 bytes long

wfrgra commented 6 years ago

Happened for me on Julia 1.0 and Feather 0.4, Julia 0.6.4 and Feather 0.3.1 work as expected

ExpandingMan commented 6 years ago

What's happening here is that when you do a = Feather.read("test.feather"), a is referring to "test.feather" and then you are trying to use the filesystem to write to the file you are already memory mapping. I suppose we should be grateful that it doesn't just go along trying to use the dataframe in blissful ignorance only to later segfault. (In this special case where you read and write identical data, I suppose in theory it ought to work, but in the general case it should not and evidently something is catching this behavior.) (And, by the way, don't use this in Julia 0.6.4 and Feather 0.3.1! The result may be unstable and lead to an unexpected segfault later on!)

Ideally we certainly would want Feather.jl to catch such an error somehow rather than sticking users with it, but at the moment I can't think of any good options for how to do this. The only solution I can think of would be to actually reference count every feather file that users write or read to or from, which in an ideal world would be supplied by the Julia Mmap module.

I'm open to suggestions if anyone has any thoughts. I'm not necessarily opposed to reference counting files within Feather, as the performance hit should only be a small one during initialization, though I haven't carefully thought through what this would look like. There might be some catch I'm not seeing.

In the meantime, you should avoid writing over files you currently have open. If you replace your Feather.read with Feather.materialize in the example above, it will work. Alternatively you can set a = nothing; GC.gc() before you write.

nalimilan commented 6 years ago

How do other implementations handle this? Or don't they use mmap?

ExpandingMan commented 6 years ago

As far as I know the other feather implementations do not use memory mapping in this way. After some thought, I think reference counting would be extremely difficult if not impossible to implement correctly. You'd only be able to do it locally and would not be able to determine if other processes are using it.

The right solution is probably to go through the OS. It looks like we might be able to use fuser or lsof but I don't really know how those work. It's possible a file will only show up as in use when it is actually being written to. We'd have to do some digging to figure out if we can actually use something like that for our case.

nalimilan commented 6 years ago

Maybe Feather.write could delete the existing file and create a new one? IIUC, the OS will then keep the old file visible to the process which mmapped it until it calls munmap.

ExpandingMan commented 6 years ago

That's a really good (and simple) idea. I'll experiment with that a bit.

I was going to mention that we have FileWatching, but I'm pretty sure that's only useful for actual writes to files.

nalimilan commented 6 years ago

In addition to this, you could also use flock or fnctl to lock the file, which will allow you (or another program) to check whether it's OK to write to the file. But it's not enforced by the OS, so it will only have an effect for programs that check it manually.

ExpandingMan commented 6 years ago

See #94. As far as I can tell, that's a rather elegant fix. Indeed, if you have an object in memory that's referring to a file and then you delete that file, it works just fine.