lmdb: Return dedicated ErrNotFound from Txn.Get and Cursor.Get

lmb commented 8 years ago

This avoids allocation of a new Error every time we look up a key that does not exist, saving two allocations per lookup.

I have a use case where most lookups end up not existing in the database, where this comes in handy. I considered changing operrno at first, unfortunately mdb_dbi_open returns MDB_NOTFOUND as well. Returning ErrNotFound in that case would have been too confusing I think.

Not sure I got all the documentation changes right, lmk what needs fixing.

bmatsuo commented 8 years ago

This is an interesting proposal. Though I feel compelled to reject it as it is. It is not technically backwards compatible. Someone could potentially be counting on the value being wrapped. I don't really want to wager on the chance of that.

Wrapping errno values with OpErrno was also a very deliberate choice. Before, when errno values were used directly, I found that passing errors up the call stack made tracing down a random MDB_NOT_FOUND or EINVAL to be extremely painful. While MDB_NOT_FOUND is special in the case of Txn.Get/Cursor.Get (though not really in mdb_dbi_open, as you point out), I'm still skeptical of the cost vs benefit in this change.

However, I am more open to introducing a new API that returns an Errno value instead of error (an interface, for which avoiding an allocation is actually impossible).

func (txn *Txn) GetErrno(dbi DBI, key []byte) ([]byte, []byte, Errno)
func (cur *Cursor) GetErrno(key []byte, val []byte, op uint) ([]byte, []byte, Errno)

This would be compatible, and I think the explicit nature of it is kind of nice. What do you think? I probably won't move on this today. So don't worry about updating ASAP.

It would also be helpful if you had some benchmark numbers about what kind of slowdown you actually would see in an application with/without this change. If you could come up with those numbers I would appreciate it.

bmatsuo commented 8 years ago

Oh yea. Did you actually see the number of allocations go down in that benchmark you wrote? Because, like I mentioned, I think a value type (non-pointer) will actually get allocated onto the heap when you turn it into an interface value.

Sorry for not putting that together sooner.

lmb commented 8 years ago

Hmm, I see your point.

The allocations come from allocating a new OpError every time we get MDB_NOTFOUND. So yeah, allocations drop. Runtime isn't my main concern, I need to prevent churning the GC. Not sure I follow what you wrote re value types being allocated in the heap.

Tbh, returning MDB_NOTFOUND is a hack that is necessary because C does not have multiple return values. The natural interface in go would be Get returning nil, nil if the key does not exist.

How about this: I turn ErrNotFound into an OpError with the appropriate Errno, but a generic op?

bmatsuo commented 8 years ago

Here is what I mean about non-pointers being used as interface values:

First, just think about it. An interface has to contain a pointer to the contained value, because the contained value can have any size and it is not known at compile time. So if I return a 100 byte struct that implements error from a function, where is the interface's pointer pointing?

Now see this example: http://play.golang.org/p/9mtA2PxU4r

And this benchmark: http://play.golang.org/p/HtebvZSuKX

Running the benchmark shows that there is one alloc/op

BenchmarkFailThingy-4   30000000            40.4 ns/op         8 B/op          1 allocs/op

And I would ask you again, are you sure that allocations dropped? It may be that the compiler is smart enough to know how much space is needed on the stack. It seemed with some versions of my example that was the case (I'm sure you notice the weird if statement -- it's there for a reason).

How about this: I turn ErrNotFound into an OpError with the appropriate Errno, but a generic op?

That seems more technically compatible. Again, there is some (smaller) possibility of people checking Op. It also seems like a fairly unsightly hack to me. It's the kind of thing that I think real evidence is needed for.

And I'm sorry if I gave you the impression that I was talking about start-to-end runtime of a benchmark. I am talking about any valid measurement of performance in a practical application. That may be total throughput, wall clock time, or response time. Any such demonstration would be very helpful in deciding to do something less that fully compatible, though honestly the chances are not great in any case.

I don't like the bloat from additional functions (especially with the prospect of additional functions for querying using interface values -- #11). But I do know people who are seriously using this project in a commercial space. I don't want to do anything that would adversely effect those users.

Granted, I am using semver for this project. But I was not intending on breaking the API in the near future.

bmatsuo commented 8 years ago

I ran the benchmarks and the allocs/op do drop. FWIW they drop in all the Get benchmarks as they all seem to encounter MDB_NOTFOUND. It is a little suspicious actually. But I don't remember if they are supposed to do that and I don't have the time for a full investigation today.

Optimizations regarding non-pointer interfaces seem perhaps more sophisticated then I believed.

It might be ok to have transactions/cursors reuse an error stored in their structs. Then repeated misses might not be so bad. I don't know.

This whole situation deserves more thought and we can't proceed too quickly. Please keep thinking about alternative APIs that are idiomatic and make sense, or otherwise a way to mask things well that is more completely compatible.

I do appreciate the work you have put in. Thank you.

lmb commented 8 years ago

I reworked the PR to split out the benchmark to make before / after comparisons easier.

Some background on my workload: clients query LMDB through the memcached protocol. Most queries return a negative answer, think whitelisting / blacklisting. I took some real-life traffic, and replayed it against an empty LMDB (this is not 100% accurate obviously).

Without this PR

With this PR

With this PR and the char* work

On the synthethic benchmarks:

$ benchcmp baseline.txt new.txt
benchmark                             old ns/op     new ns/op     delta
BenchmarkTxn_Get_missing_raw-4        507           408           -19.53%
BenchmarkCursor_Get_missing_raw-4     546           441           -19.23%

benchmark                             old allocs     new allocs     delta
BenchmarkTxn_Get_missing_raw-4        4              2              -50.00%
BenchmarkCursor_Get_missing_raw-4     4              2              -50.00%

benchmark                             old bytes     new bytes     delta
BenchmarkTxn_Get_missing_raw-4        84            48            -42.86%
BenchmarkCursor_Get_missing_raw-4     84            48            -42.86%

bmatsuo commented 8 years ago

I'm a little confused why I don't see anything related to TCP in those callgraphs. Is it actually using the network stack? Also, it seems like the largest benefit is coming from the char* change. Is that a mischaracterization of your results?

And in general, while these numbers are more like what I was asking for it's not quite demonstrating the real benefit. Because you are writing a networked server the meaningful metrics are response time or requests per second (using the real network stack). So that is what I asked for. No client cares how many allocations you made if you return the results fast enough.

In regards to the new version.

I think this is indeed closer, as I think I mentioned earlier. But the "generic" op bugs me.

Instead of ErrNotFound, you could split it up for each op and do not export the variables,

var errNotFoundDel = &OpError{"mdb_del", NotFound}
var errNotFoundGet = &OpError{"mdb_get", NotFound}
var errNotFoundCursorGet = &OpError{"mdb_cursor_get", NotFound}

(edit: added missing variable for "mdb_del")

This seems pretty close to compatible..

Are you against using the IsNotFound() helper function? Is there another need to export the error vars?

I guess there would need to be documentation saying that users must not alter OpError values because they will be reused.

bmatsuo commented 8 years ago

@lmb This is off-topic. But in regards to your use case, have you considered inserting a bloom filter in front of your Get ops? It could reduce your misses by avoiding the database entirely. If overhead is reasonable then this change would be even less significant for you.

lmb commented 8 years ago

Sorry for not responding sooner, I was out of the country.

Yes, the char* fix gets rid of the largest amount of allocations, in all cases (not just NACK). TCP does not show up since we are tracking allocations. Command lines are read into a bufio.Reader, which pre-allocates and therefore does not show up. So the graphs I posted are indeed for the production software, with TCP overheads, etc.

I'll try and get some better numbers to show you. Maybe you are right, and the benefits are small. The problem is that the benefit is hard to directly quantify: cause and result are separate over time. I'll see whether I can show some improvements with gcvis or similar.

Re the IsNotFound() helper: I'm happy to use it to keep compatibility, but overall think it is a kludge that stems (again) from wrapping a C API. Comparing err to a global is both simpler and more idiomatic in my opinion.

Neat idea to use a bloom filter, but that would require rebuilding the filter on every delete / expiry of a key (which can happen). Not sure I want that complexity, especially since that is better implemented in the application IMO.

bmatsuo commented 8 years ago

I see. And I suppose the benchmarks reuse a limited number TCP connections so you don't see allocation overhead there. Is that right?

I did merge the char* change. It is a little hard to tell what the remaining benefit from this change is because of the order in which changes were applied in your original SVGs (this PR applied before char* change).

Measuring the effect (in time) is substantially more difficult than measuring the cause. I'm not immediately sure how you get the most meaningful/relevant insight out of gcvis. Again, clients for you service don't care specifically about GC. They care how long it takes for them to get the requested data (which is only indirectly related to GC).

I was thinking you could use something like go-metrics to compute the quantiles of response times during a realistic benchmark. If the reduced allocations cause a significant drop in the ~95th percentile response time then it is an indication that the additional GC work is truly detrimental to your service's responsiveness (in the worst case -- when a gc pause happens). If the change in percentile is not significant then the response time is likely dominated by other factors like the network.

Dealing with evictions is hard in a bloom filter. There are counting bloom filters which can handle a limited number of collisions and deletes, but space requirements can balloon in order to deal with that, so it does depend on the application requirements.

lmb commented 8 years ago

Did an end to end benchmark, the numbers did not budge significantly. Good call!

bmatsuo / lmdb-go

lmdb: Return dedicated ErrNotFound from Txn.Get and Cursor.Get #65