Closed theSherwood closed 1 year ago
I'll try to give you a thorough explanation here.
Suppose I have a bunch of query subscriptions of the form {gt: X, lt: Y}
and someone writes Z
to the database. How do I determine which subscriptions need to be updated?
The simple O(n) solution is just to iterate through each subscription and check the bounds. Coming up with a sublinear solution, O(log n), is a little bit trickier. At first glance, you might think you can use a standard btree. In tuple-database, we can start with an index ordered by [X, Y]
. But you might realize that this doesn't exactly save you anything. If Z = 100, then we need to consider both of these valid solutions: [1, 100000]
and [99, 101]
. Meanwhile, we're not going to skip over ranges like [1, 10]
which are out of range. Yes, we can skip all ranges where X > 100 so we do cut out some solutions, but this is still O(n).
This kind of problem is important for reactive update, but it's also a common issue with any kind of application that deals with calendars. When you want to render a week view, you actually want to find all overlapping ranges -- this is a slightly harder problem because X is a range rather than just a single value.
Long story short, a btree simply doesn't work for this. There are weird solutions like z-curves for hacking around it, but a proper solution involves a range tree.
I have no immediate plans to implement this until I pick up a project that needs this -- which is likely to happen at some point. If you wanted to write it, I'd be happy to review your code and accept your contribution to tuple-database 😄
My plan is more-or-less to consider the storage layer as key-value storage that isn't necessarily ordered. Then I'll build a durable tree on top of that abstraction. I've done some similar things before: https://github.com/ccorcos/js-avl-tree That project is different though because the tree persistent as well (an immutable data structure) inspired by datalog. But that seems a bit overkill and performance overhead doesn't seem worth it to me.
In terms of abstractions, the API will definitely be different. I'm sure I would start with a basic implementation with an entirely separate API. But it would be pretty amazing if you could merge those at some point so you can have an index that's an amalgamation of both btree and rtrees. I'm not exactly sure exactly possible though...
Just some ideas for an api:
db.overlaps({0: {gt: X, lte: Y}, 1: {gt: A, lt: Z}})
db.intersects(...)
tx.set({0: X, 1: Y}) // a 2d point.
tx.set({0: [X,Y], 1: [A,B]}) // a 2d range
So the dimensions are numbered... I'm sure this has some drawbacks with object allocations though. People often just encode this stuff into an array...
I'm just spitballing at this point. It's hard for me to think of this stuff without having a direct problem to work on...
I hope that helps!
Thanks for that write-up. I'm still trying to wrap my head around it. I've been working on a project that is using tuple-db and I'm using an interval tree for linear ranges, but the interval tree isn't durable, and there's not yet support for higher dimensions. So this problem is definitely of interest to me.
Do you have any intuitions for the performance costs of queries on a durable rtree vs a btree? It seems like an rtree traversal would involve a lot of reads from disk. But I'm not really familiar with databases in general. So I don't have great intuitions for this sort of thing.
The performance costs for these algorithms are well-studied. Are you familiar with big-O notation? For k dimensions and rows, range trees are O(log(n)^k). Meanwhile for a binary tree, it's just O(log(n)). There's also space performance and write performance as well, but I'm just reporting the query time performance.
Something else to know and read about is Generalized Search Trees: https://gist.cs.berkeley.edu This is what Postgres uses for implementing rtrees and stuff under the hood.
If you're working with very high dimensional data (like AI embedding vectors), then r-trees aren't going to be great because of that dimensional exponent -- most people are using Hierarchical Navigable Small World for vector databases, I believe.
Can you share more about what kind of thing youre building?
The performance costs for these algorithms are well-studied. Are you familiar with big-O notation? For k dimensions and rows, range trees are O(log(n)^k). Meanwhile for a binary tree, it's just O(log(n)). There's also space performance and write performance as well, but I'm just reporting the query time performance.
I'm familiar with big-O notation. It's specifically the performance of repeated disk accesses that I don't have intuitions for. Most of the durable data I deal with is in databases, but I don't know db internals. So I don't have a good sense for how many disk accesses would be required for a query on a durable rtree. Would each comparison that is made on an rtree require a disk access to get the node? I know that there are b+trees in which you can pack more information per node so as to be able to do more comparisons with fewer reads (as compared to btrees). Is there a similar trick for rtrees?
Something else to know and read about is Generalized Search Trees: https://gist.cs.berkeley.edu/ This is what Postgres uses for implementing rtrees and stuff under the hood.
Thanks for the Generalized Search Tree link!
If you're working with very high dimensional data (like AI embedding vectors), then r-trees aren't going to be great because of that dimensional exponent -- most people are using Hierarchical Navigable Small World for vector databases, I believe.
That makes sense. I have never implemented an rtree, but if it has to do a btree-like search per dimension, I can guess that would break down pretty quickly at anything but low dimensions.
Can you share more about what kind of thing youre building?
It's another Notion-like thing but with a big focus on third-party extensibility. The editor supports overlapping text annotations. So fast range searches are a pretty big concern. At the moment, the ranges I am dealing with are linear (text), which I support with an in-memory interval tree. But I want to generalize the range search interface to tables, charts, and canvas, etc. which are higher dimensional. I also want to be able to store that positional information in the db (rather than in-memory) so that queries on that stuff can work on the backend or the frontend.
It's specifically the performance of repeated disk accesses that I don't have intuitions for.
The big-O performance is essentially the same. Big-O time = Big-O # of disk accesses. Now, that's big-O which is leaves a lot of room for improvement on the margin. For example, with b+trees, you can have larger fanout (not binary, but n-ary tree), and you can have large leaf node sized requiring linear scans. This is all a trade-off between retrieval speed and in-memory computation speed.
Yes, you can do this exact same kind of thing with r-trees. However, something to consider is that most computers now have solid state hard drives. Disk drives, back in the day, are able to read sequential memory much faster than two values that are not next to each other. However, solid state drives are both super fast, and can read two distant pieces of memory just as fast as two pieces of memory right next to each other. So that's eliminates on of the big gains. There's still operating system page sizes to consider if you're storing things in files... This is where my knowledge gets a bit stretched -- I didn't go to school for computer science so I never took an operating systems class or a databases class haha. But I believe SQLite, for example, bypasses all of that and just mallocs a big memory block and stores the database in a single binary file that it mutates directly in memory. It beats me how that all works though!
All I know is that leveldb/rocksdb are super fast, much faster than SQLite even. So if you want to create your own database kinds of things, like I am doing, you should feel confident that so long as you have good algorithmic performance (aka sub-linear big-O time), then you can rely on someone else to write the low-level storage piece and deal with all those details. And later on, if it's important, you can tweet the branching factor pretty easily and experimentally figure out what branching factor is most performant for your setup.
That makes sense. I have never implemented an rtree, but if it has to do a btree-like search per dimension, I can guess that would break down pretty quickly at anything but low dimensions.
An r-tree is a natural extension to an interval tree. If you read more about GiST (and use ChatGPT help you make sense of it ;), then you start to recognize the generality of all these things and how they relate. Basically a node has a value and some children, and when a node gets too big, you have to split the node. So for a btree, you keep track of a value, if it's n-ary then when it gets too big, you split into two nodes which take separate values, etc. For an interval tree, you push all the actual values to the leaf nodes, and the values of the inner nodes simply specify the range of values in its children. To split, you can take half of the values ordered by start range... For r-trees, it's the same thing, but in 2 dimensions. Each inner node keeps track of a rectangle that contains all of the child values...
For high dimensions, I think the problem is basically that the space is very sparse. Based on the r-tree algorithm, each dimension adds a log(n) because you need to split each dimension in half. Taking advantage of the sparsity of data in higher dimensions, though, you can take exploit that to get better performance. This might be a good analogy for 2D -- maybe you have a bunch of locations on a map, but all the locations are in 12 different cities. Now suppose that some of those cities share a longitude or a latitude, so that when your r-tree splits a node, each child node contains values from both cities, and thus contain large ranges. However, if you can more intelligently split nodes so that each node only contains values within one city, then the encapsulating range of each node is much more narrow and thus more performant for querying...
That was pretty fun to think about actually -- I had some idea about that, but I hadn't actually thought rigorously about it. Thanks for asking 😄
It's another Notion-like thing but with a big focus on third-party extensibility.
You have a demo anywhere? Kind of hard for me to really visualize... Notion has overlapping annotations, doesn't it?
Thanks for that background info. I have a much better sense for the context now. When I try to extend my range support to 2D types, I might have a crack at getting an rtree PR put together.
You have a demo anywhere?
Not yet, unfortunately. But when I do, I'll let you know!
Notion has overlapping annotations, doesn't it?
Sort of. It has overlapping basic formatting annotations (bold, italic, etc.). But the overlapping of the comment annotations appears to be broken, particularly when the annotations cross blocks. I don't use Notion daily, so I don't know how long that has been the case. But there's nothing like overlapping links or overlapping user-defined range types that I can tell. But I could certainly be missing something.
Notion definitely has overlapping comment ranges, even across blocks. I know because I build it hehe 😆
Nice. The cross-block editing is pretty slick, too.
@ccorcos If I were to make a PR for a basic rtree implementation, would you be receptive to that or would you prefer not to bother unless it's built on a GiST?
Go for it! No guarantees I accept it but I will consider it and give you feedback!
I'm interested in what benefits an rtree-based index might bring as you spoke of in the readme:
Is adding an rtree still on the roadmap? I think I've got some use-cases for that but I've also got questions as to how an rtree would work:
Thanks for all your work on tuple-db!