ccorcos / tuple-database

406 stars 19 forks source link

avoiding key encoding/decoding costs #30

Open theSherwood opened 7 months ago

theSherwood commented 7 months ago

Seems like a non-negligible amount of time is spent in encoding and decoding the keys. Sometimes, in the case of a scan, I don't want the keys at all. Other times, it would be handy to get the encoded key string and do manipulations directly on that (and run queries with it) without creating a bunch of short-lived arrays. Is this something you'd be willing to support? I could make a PR at some point (post-rewrite), if you're not opposed.

ccorcos commented 7 months ago

Yeah you're right about the performance hit. This was a good improvement a little while back: https://github.com/ccorcos/tuple-database/pull/24

Are there any particular places you have in mind?

Encoding and decoding has to happen somewhere. Its really just a matter of how you want to do it. I plan on moving this library over to use LexiCodec so that you have more control over it. Then you can simply work with strings if you want and only encode/decode at your pleasure.

theSherwood commented 7 months ago

Are there any particular places you have in mind?

Do you mean instances where I don't want to encode/decode? If so, I've got just 3 cases in mind:

  1. scans where I don't read the keys
    • This is probably the most costly use-case
  2. long tuples in which I only need to change the last segment
    • It would be handy to just slice to the last segment and concatenate on a new segment instead of doing the full decode/encode
  3. I also have use-cases where I'd like to store a reference to another tuple but it's just the encoded string
    • I'd like to make these kinds of dereference chains maximally fast

Then you can simply work with strings if you want and only encode/decode at your pleasure.

That sounds ideal. If I could only encode/decode when I need it, that would be awesome. Do you have any ideas about what the API would look like to support that sort of thing?

ccorcos commented 7 months ago

You can check out what I'm doing here: https://github.com/ccorcos/database-experiments/blob/35c21accc274b948248b750d7d2ea8bffeda9967/src/tuple-okv.ts

But from what it sounds like, you should just use lexicodec on top of an ordered key value store like LevelDb. Then you have all the control that you want.

On a separate note though, I would caution you against premature optimization. Is this actually a performance bottleneck for you? Are you worried about the CPU cost? O(n) decode where n is the length of each key is pretty efficient... Or are you worred about the Memory cost, allocating arrays and objects instead of dealing with raw strings... In my experience, if either of those are concerns for you, you're probably either (1) focusing on something that isn't actually a problem or (2) shouldn't be using a language with a garbage collector like JavaScript. There's a big difference between theoretical performance and actual performance... I would urge you to come up with an example that you feel like is too slow so we have something concrete to iterate a solution on.

theSherwood commented 7 months ago

You can check out what I'm doing here: https://github.com/ccorcos/database-experiments/blob/35c21accc274b948248b750d7d2ea8bffeda9967/src/tuple-okv.ts

Thanks. This is a great reference

(1) focusing on something that isn't actually a problem

I haven't done any very rigorous performance tests, yet. But from the times I ran the chrome profiler, the encode/decode utilities took up a larger percentage of the overall time than I had anticipated.

(2) shouldn't be using a language with a garbage collector like JavaScript

I'd love to write the core of my app for compilation to webassembly, but I want it to be usable from javascript. And I think the overhead of passing values back and forth across the interface might be more costly than just running the core in javascript. I don't have enough experience with wasm to know for sure and don't have any idea of a fast path to getting more insight that isn't just building the app twice and running benchmarks. So I'm playing it safe with javascript at the moment.

I would urge you to come up with an example that you feel like is too slow so we have something concrete to iterate a solution on.

That's fair enough. Feel free to close this issue until I find something more concrete

ccorcos commented 7 months ago

Encoding / decoding across process boundaries is very commonly the most expensive thing in any application... If you use the in-memory tuple database, there's no encoding/decoding going on. But serializing to and from a database is always going to have that cost.