Closed GoogleCodeExporter closed 9 years ago
Hmm... That would seem to violate one of LevelDB's design contracts: all writes
go to Level 0, which is in memory.
Could you instead snapshot the database, do the writes individually, then toss
away the snapshot?
Original comment by cbsm...@gmail.com
on 8 May 2012 at 10:38
Then we'd have to prevent anyone else from modifying any part of the database
while we were streaming in a potentially-huge blob, which would be awkward.
And if we crashed while doing the streaming, we'd have to do manually validate
our state and clean up the next time we started up, which is exactly what
transactions are supposed to prevent you from having to do.
But about that contract: where is that written? I just looked around and found
http://leveldb.googlecode.com/svn/trunk/doc/impl.html, which doesn't say
anything about level 0 having to be in memory. We'll quickly blow past the 1MB
limit for level 0 anyway. And is that level-0 table for things that are
already committed, or things being prepped for commit [in a WriteBatch like
what I'm talking about]?
Original comment by ericu@chromium.org
on 8 May 2012 at 10:54
For Chromium's IndexedDB implementation we could construct multiple leveldb
WriteBatches per IDB transaction, we just don't currently do so. I don't think
believe leveldb would be limiting us here.
Original comment by jsb...@google.com
on 9 May 2012 at 12:29
My bad. It's not Level-0, but the memtable. If your records are that big they'd
effectively cause churning on the memtables. One could conceivably have an API
that bypasses the memtables, but you'd still have pretty massive churn as it
seems likely you'd trigger compactions of Level-0 and Level-1, if not Level-2.
The net effect of all this might not be that different from locking down
updates to the database while you stuff in this giant record.
Would you then pass in these bits of memory as mmap'd memory regions to avoid
them necessarily being pulled in to RAM?
It almost seems like after a certain size, you would be much better off having
a blob store which is outside of LevelDB: write to the filesystem and merely
store the path to the file in LevelDB. This is something that could be
constructed as a layer on top of LevelDB. Just a thought.
Original comment by cbsm...@gmail.com
on 9 May 2012 at 1:28
Re: mmap--It's probably not necessary to make the API that complicated. We'd
just keep the chunks reasonably small.
We'd like to avoid having a separate blob store. The goal is to have
everything live in a single database for atomicity of transactions.
Original comment by ericu@google.com
on 9 May 2012 at 2:51
> Currently WriteBatch is an in-memory construct, so even if we break up the
> Blob into chunks, it's all going to be kept in memory anyway. I'd like to
> request that you add either a WriteBatch that spills to disk as you build
> it, or some other way to stream large data into the db.
This will add a lot of complication to the leveldb implementation.
> Re: mmap--It's probably not necessary to make the API that complicated.
> We'd just keep the chunks reasonably small.
That may be true for your project, but perhaps other users of leveldb
will want the decision to made differently. I am disinclined to add
complications and options to leveldb if there is a reasonable way for
the application to achieve what it wants at a higher level.
> We'd like to avoid having a separate blob store. The goal is to have
> everything live in a single database for atomicity of transactions.
You can do something fairly straightforward without any leveldb
changes. Here is a sketch of one such implementation. Suppose you
want to write blob B under key K. Also suppose that B is split up
into n reasonably sized chunks B1..Bn.
(1) First store blob chunks directly in the filesystem, one chunk per
file, with a unique id used to determine the file name Fi for each chunk.
(2) Use a normal leveldb op to store the filenames F1..Fn under key
K. You can atomically commit multiple blobs here if you need to.
Also, to speed up step (3), in a small reserved section of the
keyspace, record a mapping like this
file:K,F => S
where K is the key, F is the file name, and S is the file size.
(3) Do a periodic garbage collection pass:
Delete all files F in the file system unless you can find
an entry of the form "file:K,F=>S" in leveldb and the corresponding
leveldb entry for K contains F.
(4) Augment your calls to DB::GetApproximateSizes() so that when
computing the size for range [A..B), you scan over the range
[file:A, .. file:B,) and add up the sizes you find.
By storing the large values directly in the file system, you avoid
paying any compaction overhead where multi-GB values get rewritten a
few times and/or cause lots of background compaction work.
Some possible improvements that may or may not be worthwhile:
(a) Use the SHA1 hash of the chunk contents for the filename so
that if the same value is written multiple times, you just need
to store one file. GC will become marginally more complicated.
(b) You can keep a copy of the filenames and associated keys in memory
so that you can delete a file immediately after overwriting or
deleting a key. Though you will still need a GC phase in case your
process crashes just after the key overwrite/deletion completes.
(c) You can wrap this whole logic in a brand new class that implements
the leveldb::DB interface so that the code that calls leveldb does not
need to change.
(d) If you really wish, you could store the blob chunks in a reserved
key range in leveldb instead of in the file system. Just be careful
that when you garbage collect, you can do so without having to pull
the blob chunks into memory.
Original comment by san...@google.com
on 9 May 2012 at 5:00
Glad to see you say this Sanjay!
LevelDB's persistence model is really in conflict with such large data values
anyway. IndexedDB, being an API rather than an implementation, requires a more
complex interface as it is more of a document store (something more like
CouchBase) than a record store (something more like LevelDB), but could serve
as either. Looking at the IndexedDB API, you would be much better served by
storing values above a certain size (a cutoff in the 1K-4K range I should
think, but a not unreasonable approach would be to have the cutoff effectively
be 0 bytes and store only indexes/keys inside LevelDB) in a blob store. Using a
separate streaming API where you break things up in to chunks is going to add
far more complexity to your code anyway.
I'd actually be tempted to suggest you don't even split the data up: just store
the data as a single blob. Reserve a lead byte (or bit) in your value records
to indicate whether you have an external blob (in which case the rest of the
value is the path) or an internally stored record (in which case the rest of
the value is the actual data). You can potentially avoid the need for Sanjay's
garbage collection if you write initially to a tempfile, rename/hard link after
inserting a LevelDB record, and treat records pointing to non-existent paths as
"never completed commit".
The slice interface LevelDB provides should allow internally stored records to
still be accessed quite efficiently, and would avoid significant performance
complications stemming from the large records. I think you could even store the
blobs in an appropriately named subdirectory inside the LevelDB database
directory.
You're going to need something like this indirection mechanism for your indexed
fields anyway (perhaps for efficiency with these you'd store both the path and
an offset in to the file for the attribute in question).
Original comment by cbsm...@gmail.com
on 9 May 2012 at 8:50
Original issue reported on code.google.com by
ericu@chromium.org
on 8 May 2012 at 10:31