Poor read performance. - Githubissues

stefan-guggisberg commented 10 years ago

i am working on an mvcc hierarchical store implementation supporting structured, semi-structured and unstructured data using leveldown for persistence.

i've experimented with locket but hit strange performance issues.

i populated the db with ~200k entries. keys are binary, 20 bytes (sha1 of value). values are binary, a couple of hundreds to a few thousand bytes each.

the entries make up a mvcc git-like tree model.

populating the db was surprisingly fast, i read the entries from a leveldb instance and wrote them in batches of 10k entries. write performance was ~0.5ms/entry.

reading however is very slow, on average ~300ms/entry for random keys.

i store a special entry with if an zero id ([ 0,0,0, .. ]). reading this special entry takes 4-6 seconds...

when i run the same code with leveldown instead of locket i get consistent 0.5ms/entry for reading.

i am using default options for locket. reading and writing is done with {asBuffer: true}.

locket is a great project and i'd love to be an early adopter. is there anything i can tweak in order to get better read performance?

i'd be happy to share the locket files (~40mb zip) if it helps.

UPDATE: Performance Task List

[x] Implement a simplified record structure so that the record writing does not look so terrible and checksums are pluggable.
[x] Move the balance operations out of Balancer and into Sheaf in Strata.
[x] Create JSPerf style benchmarks for Cadence, compare to other control flow libraries to see if it much slower.
[ ] Strip Strata of anything that is a shim.
[ ] Create a profiler library to wrap Strata functions.
[x] Create a benchmark directory for Strata in Locket.
[ ] Begin caching aggressively.

bigeasy commented 10 years ago

Version 0.0.2 has been released. If you're still interested in using Locket, try it again. Note that you must find time to call ._merge(callback) every now and again. This will be automated at some point, but it's at user discretion now. Performance should improve considerably, provided there are no show stopping bugs. I'll be putting this into production soon, so I'll have an opportunity to exercise it myself.

I have more time for this project at the moment, so if you do resume adoption, I will be able to respond to your questions in a timely manner.

stefan-guggisberg commented 10 years ago

i tried again with v0.0.2. i noticed that batch writes are ~7x slower now. writing a batch of 10k entries takes now 35s (compared to 5s with 0.0.1). i am calling ._merge after each batch write.

reads are still very slow. reading the entry with the 20 byte all zero buffer key takes 7 seconds... reading entries with random keys takes on average 300ms.

if i switch from locket to leveldown i get consistent reads of <1ms.

bigeasy commented 10 years ago

I'm now using Node.js streams instead of calling fs.write directly and it is that much slower? Ugh.

Can you share the code or data with me?

stefan-guggisberg commented 10 years ago

i am using binary values keyed by a 20-byte key (sha1 of value).

i am testing with product data (200k entires) which i am unfortunately not allowed to share. i will have to create mock data which might take some time.

i cannot share the code but i can make up some dummy code which performs random reads.

bigeasy commented 10 years ago

Create a benchmark. It is two orders of magnitude slower than LevelDOWN.

 % time node load.js
count 2568
node load.js  20.71s user 1.09s system 102% cpu 21.193 total
 % time node load.js --leveldown
count 2568
node load.js --leveldown  0.21s user 0.05s system 103% cpu 0.246 total

Dreadful.

Need to determine how much slower Cadence makes an operation. I short circuited some Cadence in a hot spot to see if it help. The binary search allows for lazy loading of leaf page entries, so for leaf pages it needs to be asynchronous. Branch pages are always loaded at once, so if when searching a branch page Strata uses a while loop. With Locket's staging trees, the records are always in memory, at least in the current implementation. I tried short circuiting the call to the function that reads the record, checking to see if the record is in memory already. That short circuit did nothing for these coarse timings. Thus, while I plan on benchmarking Cadence against alternative asynchronous library implementations and synchronous constructs, it is best to look elsewhere for performance increases first.

Moving to streams was a mistake. I'm queuing up a Buffer for each record in a write stream. Before I was calling write myself, so at least there was far less abstraction between Strata and the file system.

However, the problem isn't the layers between the write and the file system, it is going to be the frequent tiny writes. I need to be working with buffers. In fact, I need to let go of two notions that still plague my thinking.

Locket and Strata are written in and for Node.js, not JavaScript.
Strata manipulates Objects and Buffers not files or Streams.

Thus, Strata needs to be stripped of anything that is not Node.js or a shim to allow it work in places other than Node.js. Once you go down an optimization path, focusing on a specific platform is the first, obvious optimization.

Splitting out the file system operations should be done only for isolation and testing, not so that they can be replaced for non-Node.js alternatives. It's not that there are a lot of shims, but there's still a nattering voice in my thinking that is telling me that these things could be further abstracted. You can see it in a bit of the initialization code that there are options that ought not to be optional. These are remnants, but they exist, they are echoes of past voices, and I am highly suggestible.

A database does not store streams. That's what a file system does. I'm not implementing a Node.js file system, I'm implementing a Node.js database.

However, this does not mean a rewrite. Staccato still makes sense, converting stream writes to error-first callbacks, but it shouldn't be called for each record. I don't want to block on fs.read when I don't have to, so Staccato stays.

I'm adding Buffer building logic. Building large Buffers and writing them out. I'm also going to remove the pre-mature space optimizations and slurp pages so that when a page is visited, all of the records are in memory. My concern was that because pages are essentially logs, they might get very long and we could start from the end of the log and load records as needed, the binary search might only need a handful of records. But, the b-tree is already supposed to be a solution for this. But, this is probably a pre-mature dis-optimization.

How do you isolate what is slow?

I tried using v8's profiler and profviz, but all I can see is that you call Cadence a lot. I know that. When operations are broken up into many small functions, and when an operation does not complete until a callback is invoked, you can't see timings in the call stack based reports generated by most profilers. The particular steps in an asynchronous flow do not stand out as slow. The slowness is between the steps and not seen in a profile.

Because I've stuck with error-first callbacks, it is easy to isolate the duration of an operation by simply wrapping a function that takes an error-first callback. I'll create something that does just that and wrap the functions in Strata to get timings with names excluding the control flow library overhead.

A task list...

Implement a simplified record structure so that the record writing does not look so terrible and checksums are pluggable.
Move the balance operations out of Balancer and into Sheaf in Strata.
Create JSPerf style benchmarks for Cadence, compare to other control flow libraries to see if it much slower.
Strip Strata of anything that is a shim.
Create a profiler library to wrap Strata functions.
Create a benchmark directory for Strata in Locket.
Begin caching aggressively.

Note that Strata is where the slow is, but I'll be exercising Strata through benchmarks maintained in Locket. I'd end up re-creating all the different ways in which Locket invokes Strata in a Strata benchmark.

bigeasy commented 9 years ago

Performance has improved dramatically. Reads are still slowish, but not dreadful. Writes, for your basic case are fast. Using just Strata, performance is an order of magnitude away from LevelDB on write, and three times slower on read. This is down from two orders of magnitude.

The -d switch below means to use LevelDB.

 % node locket/benchmark/load.js  -d
insert: 46ms, gather: 66ms
 % node locket/benchmark/load.js
insert: 510ms, gather: 1149ms
 % node strata/benchmark/load.js
insert: 261ms, gather: 133ms

There is still room for improvement in both Locket and it's basis, Strata. There is a lot of room for improvement in Locket read times.

bigeasy commented 9 years ago

And so, I'm going to close this when I get the gather benchmark closers to 200ms.

bigeasy / locket

Poor read performance. #138