Closed wfrgra closed 6 years ago
Happened for me on Julia 1.0 and Feather 0.4, Julia 0.6.4 and Feather 0.3.1 work as expected
What's happening here is that when you do a = Feather.read("test.feather")
, a
is referring to "test.feather"
and then you are trying to use the filesystem to write to the file you are already memory mapping. I suppose we should be grateful that it doesn't just go along trying to use the dataframe in blissful ignorance only to later segfault. (In this special case where you read and write identical data, I suppose in theory it ought to work, but in the general case it should not and evidently something is catching this behavior.) (And, by the way, don't use this in Julia 0.6.4 and Feather 0.3.1! The result may be unstable and lead to an unexpected segfault later on!)
Ideally we certainly would want Feather.jl to catch such an error somehow rather than sticking users with it, but at the moment I can't think of any good options for how to do this. The only solution I can think of would be to actually reference count every feather file that users write or read to or from, which in an ideal world would be supplied by the Julia Mmap
module.
I'm open to suggestions if anyone has any thoughts. I'm not necessarily opposed to reference counting files within Feather, as the performance hit should only be a small one during initialization, though I haven't carefully thought through what this would look like. There might be some catch I'm not seeing.
In the meantime, you should avoid writing over files you currently have open. If you replace your Feather.read
with Feather.materialize
in the example above, it will work. Alternatively you can set a = nothing; GC.gc()
before you write.
How do other implementations handle this? Or don't they use mmap?
As far as I know the other feather implementations do not use memory mapping in this way. After some thought, I think reference counting would be extremely difficult if not impossible to implement correctly. You'd only be able to do it locally and would not be able to determine if other processes are using it.
The right solution is probably to go through the OS. It looks like we might be able to use fuser or lsof
but I don't really know how those work. It's possible a file will only show up as in use when it is actually being written to. We'd have to do some digging to figure out if we can actually use something like that for our case.
Maybe Feather.write
could delete the existing file and create a new one? IIUC, the OS will then keep the old file visible to the process which mmapped it until it calls munmap
.
That's a really good (and simple) idea. I'll experiment with that a bit.
I was going to mention that we have FileWatching, but I'm pretty sure that's only useful for actual writes to files.
In addition to this, you could also use flock
or fnctl
to lock the file, which will allow you (or another program) to check whether it's OK to write to the file. But it's not enforced by the OS, so it will only have an effect for programs that check it manually.
See #94. As far as I can tell, that's a rather elegant fix. Indeed, if you have an object in memory that's referring to a file and then you delete that file, it works just fine.
Reading a feather file followed by a write back to the same file fails:
The file test.feather is now 0 bytes long