Maybe some misunderstanding of how SQL DB indexes work? 🤔

quolpr commented 2 years ago

I was reading your doc, and noticed this thing:

Yet SQL does not provide a way of indexing queries that involve a JOIN. Social apps that want to query "what are all the posts of all the people I follow ordered in time" must design their own systems because SQL cannot index that query (SQL will need to load all of the posts of all the people you follow over all time, and sort them in memory!).

But, if you will put index on those fields that you are joining, then SQL DB will not load the records to the db. Only indexes. Am I wrong? 🤔 Or maybe I missed your point.

I made such DB for SQLite:

CREATE TABLE "users" ("id" integer NOT NULL, PRIMARY KEY (id));
CREATE UNIQUE INDEX "id_idx" ON "users" ("id");

CREATE TABLE "comments" ("id" integer NOT NULL,"data" text NOT NULL, "user_id" INTEGER NOT NULL, PRIMARY KEY (id));
CREATE UNIQUE INDEX "com_id_idx" ON "comments" ("id");
CREATE INDEX "user_id_idx" ON "comments" ("user_id");

And then explain of this query:

EXPLAIN QUERY PLAN SELECT comments.id, users.id FROM comments JOIN users ON users.id = comments.user_id;

-->

SCAN comments USING COVERING INDEX user_id_idx;
SEARCH users USING INTEGER PRIMARY KEY (rowid=?);

So you can see, that it will be making SCAN through the index, which is very fast and doesn't require loading the whole DATA from db. I hope I understood your points correctly 🤔

quolpr commented 2 years ago

Ohh, sorry, you mentioned many-to-many. I see.

But for searching the people that I follow, it will use index. Then, to get all posts, you still will need to load all posts into memory. No matter what DB you use 🤔 Or do you mean the case, when you want also LIMIT the returning result?

Also some offtopic

I am working on https://github.com/trong-orm/trong-orm , and I already added a lof native integrations with other SQLite DBs. So if you will need help to integrate more storages to support platforms wider, I will glad to 🙂

ccorcos commented 2 years ago

So you're right that a simple many-to-many relationship query is indexed but it fails once that relationship is faceted.

For example, here's a basic query for a social media follower feed:

CREATE TABLE "users" ("id" integer NOT NULL, PRIMARY KEY (id));
CREATE UNIQUE INDEX "id_idx" ON "users" ("id");

CREATE TABLE "posts" ("id" integer NOT NULL, "data" text NOT NULL, "user_id" INTEGER NOT NULL, "created_at" INTEGER NOT NULL, PRIMARY KEY (id));
CREATE UNIQUE INDEX "user_id_idx" ON "posts" ("user_id");
CREATE UNIQUE INDEX "created_at_idx" ON "posts" ("created_at");
CREATE UNIQUE INDEX "user_id_created_at_idx" ON "posts" ("user_id", "created_at");

CREATE TABLE "follows" ("from_id" integer NOT NULL, "to_id" integer NOT NULL, PRIMARY KEY (from_id, to_id));

EXPLAIN QUERY PLAN SELECT posts.id FROM posts JOIN follows ON posts.user_id = follows.to_id WHERE follows.from_id = 1 ORDER BY posts.created_at LIMIT 10;

-- QUERY PLAN
-- |--SEARCH follows USING COVERING INDEX sqlite_autoindex_follows_1 (from_id=?)
-- |--SEARCH posts USING INDEX user_id_idx (user_id=?)
-- `--USE TEMP B-TREE FOR ORDER BY

This query plan will fetch all the follows, recursively fetch every post from every user, and then order them in memory...

There's no way around this using SQL without manually denormalizing data. (And I think it's one of the main reasons we don't see more competition in social media apps -- this is a challenging problem!)

ccorcos commented 2 years ago

Trying to come up with a more minimal example with users and tags.


CREATE TABLE "users" ("id" integer NOT NULL, "age" integer NOT NULL, PRIMARY KEY (id));
CREATE UNIQUE INDEX "id_idx" ON "users" ("id");
CREATE UNIQUE INDEX "age_idx" ON "users" ("age");

CREATE TABLE "tags" ("user_id" integer NOT NULL, "tag" text NOT NULL, PRIMARY KEY ("user_id", "tag"));
CREATE UNIQUE INDEX "tag_idx" ON "tags" ("tag", "user_id");

EXPLAIN QUERY PLAN SELECT users.id FROM users JOIN tags ON tags.user_id = users.id WHERE tags.tag = 'engineer' ORDER BY users.age LIMIT 10;

-- QUERY PLAN
-- |--SEARCH tags USING COVERING INDEX tag_idx (tag=?)
-- |--SEARCH users USING INTEGER PRIMARY KEY (rowid=?)
-- `--USE TEMP B-TREE FOR ORDER BY

This demonstrates the point that adding the ORDER BY user.age LIMIT 10 forces you to fetch EVERY user for a given tag and sort it in memory before returning the first 10 results... This is results in O(n) kinds of performance, not O(log n).

quolpr commented 2 years ago

@ccorcos yep, it looks fair. But I can't get why if I will add sorting index to age:

CREATE UNIQUE INDEX "age_idx" ON "users" ("age" ASC);

It is still using TEMP B-TREE 🤔 I thought that sorting index will do the job. Do you know why it doesn't use sorting index?

And I am curious, will it actually load all the rows into memory, or it will load only age + rowId fields, sort them by age, and then make full load of other records by top 10 rowid. So actual memory consumption should not be match (but indexes are still better, of course).

Just to be clear: my goal is not to doubt your theory, I want to make more clear understanding for myself 🙂

ccorcos commented 2 years ago

I'll demonstrate with a concrete example:

A user's age is going to be random.

id	age
1	12
2	22
3	4
4	41
...	...
99	99

Every even user is an engineer.

user_id	tag
1	manager
2	engineer
3	manager
4	engineer
...	...
99	manager

tag_idx looks like this:

user_id	tag
2	engineer
4	engineer
...	...
98	engineer

age_idx looks like this:

age	id
3	4
28	8
1	12
37	13
...	...
99	99

So here's how it works:

SEARCH tags USING COVERING INDEX tag_idx (tag=?) We can get a list of all the engineers quickly using tag_idx → [2, 4, 6, ..., 98]. But now I want the 10 youngest engineers but this list is not in any kind of sorted order regarding age.
SEARCH users USING INTEGER PRIMARY KEY (rowid=?) We need to recursively fetch the age of every user to determine their age.
USE TEMP B-TREE FOR ORDER BY Finally, we need to sort this whole list so we can grab the first 10 items.

You're right about how SQLite can be efficient with what it fetches, but from a high-level algorithmic complexity, this performance is awful because its performance is O(n) in the number of engineers. It's loading EVERY engineer and sorting them by age on every query, just to get the first 10.

What you need is an index like (tag, age, id), but there are two challenges: tag and age are in different tables, and a user may have many tags... You can build this index yourself but you'll have to manage all insertions and deletions yourself... (you can probably use a SQL trigger actually)

CREATE TABLE "tag_by_age" (
  "user_id" integer NOT NULL, 
  "tag" text NOT NULL, 
  "age" integer NOT NULL,
  PRIMARY KEY ( "tag", "age", "user_id")
);

In any case, the tuple-database makes all of this much easier:

Write to the index: tx.set([tag, age, id], null)
Read the index: tx.scan({prefix: ["engineer"], limit: 10})

quolpr commented 2 years ago

@ccorcos ooops, I see your point! Thank you for the descriptive example. I also played around with indexes, and yeah, I was wrong. I have some misunderstanding how indexes works.

ccorcos commented 2 years ago

No problem :)

I wrote this article a while back about filing cabinets as a metaphor for databases. And I think it might help you understand these concepts a bit deeper.

https://ccorcos.github.io/filing-cabinets/

I has always thought of databases as these magical black boxes, but they’re actually quite simple when you open them up. No different than managing a bunch of filing cabinets. 😄

On Sun, Jul 31, 2022 at 11:32 Sergey @.***> wrote:

@ccorcos https://github.com/ccorcos ooops, I see your point! Thank you for the descriptive example. I also played around with indexes, and yeah, I was wrong. I have some misunderstanding how indexes works.

— Reply to this email directly, view it on GitHub https://github.com/ccorcos/tuple-database/issues/11#issuecomment-1200476473, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANWDX5J3DFEAXJDM3YYNL3VW3BE5ANCNFSM533GMQXA . You are receiving this because you were mentioned.Message ID: @.***>

ccorcos / tuple-database

Maybe some misunderstanding of how SQL DB indexes work? 🤔 #11