Closed quolpr closed 2 years ago
Ohh, sorry, you mentioned many-to-many. I see.
But for searching the people that I follow, it will use index. Then, to get all posts, you still will need to load all posts into memory. No matter what DB you use π€ Or do you mean the case, when you want also LIMIT the returning result?
So you're right that a simple many-to-many relationship query is indexed but it fails once that relationship is faceted.
For example, here's a basic query for a social media follower feed:
CREATE TABLE "users" ("id" integer NOT NULL, PRIMARY KEY (id));
CREATE UNIQUE INDEX "id_idx" ON "users" ("id");
CREATE TABLE "posts" ("id" integer NOT NULL, "data" text NOT NULL, "user_id" INTEGER NOT NULL, "created_at" INTEGER NOT NULL, PRIMARY KEY (id));
CREATE UNIQUE INDEX "user_id_idx" ON "posts" ("user_id");
CREATE UNIQUE INDEX "created_at_idx" ON "posts" ("created_at");
CREATE UNIQUE INDEX "user_id_created_at_idx" ON "posts" ("user_id", "created_at");
CREATE TABLE "follows" ("from_id" integer NOT NULL, "to_id" integer NOT NULL, PRIMARY KEY (from_id, to_id));
EXPLAIN QUERY PLAN SELECT posts.id FROM posts JOIN follows ON posts.user_id = follows.to_id WHERE follows.from_id = 1 ORDER BY posts.created_at LIMIT 10;
-- QUERY PLAN
-- |--SEARCH follows USING COVERING INDEX sqlite_autoindex_follows_1 (from_id=?)
-- |--SEARCH posts USING INDEX user_id_idx (user_id=?)
-- `--USE TEMP B-TREE FOR ORDER BY
This query plan will fetch all the follows, recursively fetch every post from every user, and then order them in memory...
There's no way around this using SQL without manually denormalizing data. (And I think it's one of the main reasons we don't see more competition in social media apps -- this is a challenging problem!)
Trying to come up with a more minimal example with users and tags.
CREATE TABLE "users" ("id" integer NOT NULL, "age" integer NOT NULL, PRIMARY KEY (id));
CREATE UNIQUE INDEX "id_idx" ON "users" ("id");
CREATE UNIQUE INDEX "age_idx" ON "users" ("age");
CREATE TABLE "tags" ("user_id" integer NOT NULL, "tag" text NOT NULL, PRIMARY KEY ("user_id", "tag"));
CREATE UNIQUE INDEX "tag_idx" ON "tags" ("tag", "user_id");
EXPLAIN QUERY PLAN SELECT users.id FROM users JOIN tags ON tags.user_id = users.id WHERE tags.tag = 'engineer' ORDER BY users.age LIMIT 10;
-- QUERY PLAN
-- |--SEARCH tags USING COVERING INDEX tag_idx (tag=?)
-- |--SEARCH users USING INTEGER PRIMARY KEY (rowid=?)
-- `--USE TEMP B-TREE FOR ORDER BY
This demonstrates the point that adding the ORDER BY user.age LIMIT 10
forces you to fetch EVERY user for a given tag and sort it in memory before returning the first 10 results... This is results in O(n) kinds of performance, not O(log n).
@ccorcos yep, it looks fair. But I can't get why if I will add sorting index to age:
CREATE UNIQUE INDEX "age_idx" ON "users" ("age" ASC);
It is still using TEMP B-TREE π€ I thought that sorting index will do the job. Do you know why it doesn't use sorting index?
And I am curious, will it actually load all the rows into memory, or it will load only age
+ rowId
fields, sort them by age
, and then make full load of other records by top 10 rowid
. So actual memory consumption should not be match (but indexes are still better, of course).
Just to be clear: my goal is not to doubt your theory, I want to make more clear understanding for myself π
I'll demonstrate with a concrete example:
A user's age is going to be random.
id | age |
---|---|
1 | 12 |
2 | 22 |
3 | 4 |
4 | 41 |
... | ... |
99 | 99 |
Every even user is an engineer.
user_id | tag |
---|---|
1 | manager |
2 | engineer |
3 | manager |
4 | engineer |
... | ... |
99 | manager |
tag_idx looks like this:
user_id | tag |
---|---|
2 | engineer |
4 | engineer |
... | ... |
98 | engineer |
age_idx looks like this:
age | id |
---|---|
3 | 4 |
28 | 8 |
1 | 12 |
37 | 13 |
... | ... |
99 | 99 |
So here's how it works:
SEARCH tags USING COVERING INDEX tag_idx (tag=?)
We can get a list of all the engineers quickly using tag_idx
β [2, 4, 6, ..., 98]
. But now I want the 10 youngest engineers but this list is not in any kind of sorted order regarding age. SEARCH users USING INTEGER PRIMARY KEY (rowid=?)
We need to recursively fetch the age of every user to determine their age.USE TEMP B-TREE FOR ORDER BY
Finally, we need to sort this whole list so we can grab the first 10 items.You're right about how SQLite can be efficient with what it fetches, but from a high-level algorithmic complexity, this performance is awful because its performance is O(n) in the number of engineers. It's loading EVERY engineer and sorting them by age on every query, just to get the first 10.
What you need is an index like (tag, age, id)
, but there are two challenges: tag and age are in different tables, and a user may have many tags... You can build this index yourself but you'll have to manage all insertions and deletions yourself... (you can probably use a SQL trigger actually)
CREATE TABLE "tag_by_age" (
"user_id" integer NOT NULL,
"tag" text NOT NULL,
"age" integer NOT NULL,
PRIMARY KEY ( "tag", "age", "user_id")
);
In any case, the tuple-database makes all of this much easier:
tx.set([tag, age, id], null)
tx.scan({prefix: ["engineer"], limit: 10})
@ccorcos ooops, I see your point! Thank you for the descriptive example. I also played around with indexes, and yeah, I was wrong. I have some misunderstanding how indexes works.
No problem :)
I wrote this article a while back about filing cabinets as a metaphor for databases. And I think it might help you understand these concepts a bit deeper.
https://ccorcos.github.io/filing-cabinets/
I has always thought of databases as these magical black boxes, but theyβre actually quite simple when you open them up. No different than managing a bunch of filing cabinets. π
On Sun, Jul 31, 2022 at 11:32 Sergey @.***> wrote:
@ccorcos https://github.com/ccorcos ooops, I see your point! Thank you for the descriptive example. I also played around with indexes, and yeah, I was wrong. I have some misunderstanding how indexes works.
β Reply to this email directly, view it on GitHub https://github.com/ccorcos/tuple-database/issues/11#issuecomment-1200476473, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANWDX5J3DFEAXJDM3YYNL3VW3BE5ANCNFSM533GMQXA . You are receiving this because you were mentioned.Message ID: @.***>
I was reading your doc, and noticed this thing:
But, if you will put index on those fields that you are joining, then SQL DB will not load the records to the db. Only indexes. Am I wrong? π€ Or maybe I missed your point.
I made such DB for SQLite:
And then explain of this query:
-->
So you can see, that it will be making SCAN through the index, which is very fast and doesn't require loading the whole DATA from db. I hope I understood your points correctly π€