JuliaData / JuliaDB.jl

Parallel analytical database in pure Julia
http://juliadb.org/
Other
768 stars 62 forks source link

JuliaDB.save and JuliaDB.load operations are locking the file on Windows #168

Open bmharsha opened 6 years ago

bmharsha commented 6 years ago

Here are the minimal steps that can be used to reproduce this issue on Windows

image

You will be able to delete above files if you quit your Julia process.

MaximilianJHuber commented 6 years ago

JuliaDB memory-maps the table to the file. Why would you want to delete it?

bmharsha commented 6 years ago

Why would you want to delete it?

I was trying to fix https://github.com/JuliaComputing/JuliaDB.jl/issues/166 and encountered this error as the test progressed (More details in this comment), I felt this might be a bug, so reported this issue.

deo1 commented 4 years ago

JuliaDB memory-maps the table to the file. Why would you want to delete it?

A scenario I can think of is when you are saving data to disk for long term storage, but may overwrite the version on disk within the same Julia process lifetime. A common pattern is e.g.

files:

You may load, merge, overwrite multiple times during the process lifetime. This is the ETL pipeline scenario, not the (potentially) out-of-memory analytics scenario.

A simple workaround that I have used is to remove the load(f) from src/io.jl save(data::Dataset, f::AbstractString) with a more specific dispatch signature.

e.g.

import JuliaDB.save
using MemPool

function JuliaDB.save(data::IndexedTable, f::AbstractString)
    sz = open(f, "w") do io
        MemPool.serialize(io, MemPool.MMWrap(data))
    end
    # load(f)  # remove this
end

Apologies, after messing around a bit more, this doesn't cover the scenario where you save, load, and save again. The simplest workaround for this seems to be to just use the low level serialize and deserialize methods directly.

e.g.

using JuliaDB, MemPool

f = "./data/test/scratch.bin"
t1 = table((a = [1,2,3,4], b = [1,3,6,6]), pkey=:b)

MemPool.serialize(f, t1)
t2 = MemPool.deserialize(f)
MemPool.serialize(f, merge(t1, t2))
t3 = MemPool.deserialize(f)

@assert t3 == merge(t1, t2)