ctapobep / blog

My personal blog on IT topics in the form of GitHub issues.
6 stars 0 forks source link

Clustered, non-clustered, covering indexes in databases #9

Open ctapobep opened 2 years ago

ctapobep commented 2 years ago

Related article: Should we sort by ID when writing SQL?

Heap files and Indexes

One of the ways of storing table records (aka rows or tuples) is to just create a big file, split it into logical pages and then fill those pages as new data is inserted. So when we need to find all records that satisfy some condition (select * from dogs where age > 10) we need to do a sequential table scan (go over each row one by one). But we don't like this approach as going through all the rows is inefficient. Therefore we create indexes.

Index is a key->value storage which keeps the search keys sorted e.g. in a B-Tree. Since the keys (like age) are sorted, it's easy to find all of those that satisfy the criterion. But what about values: do we store the actual rows in the B-Tree or do we reference some place on a disk where row is located? Well that is precisely the difference between Clustered and Non-Clustered indexes.

Clustered Indexes

In many articles when defining the term Clustered Indexes the wording is very confusing. E.g. in this post on SO:

With a clustered index the rows are stored physically on the disk in the same order as the index.

Which implies that we store data in a heap file, just sort it according to some index. But this isn't true - what's really happening is that we embed our table in the clustered index (or we can say we embed an index in the table). Which automatically will sort table data according to the B-Tree index.

For instance in MySQL InnoDB:

Each InnoDB table has a special index called the clustered index that stores row data.

MSSQL Server docs, albeit confusing as well, also state that clustered index and the table is a single structure:

The only time the data rows in a table are stored in sorted order is when the table contains a clustered index. When a table has a clustered index, the table is called a clustered table. If a table has no clustered index, its data rows are stored in an unordered structure called a heap.

So other synonyms for Clustered Index are: Clustered Table (MS SQL Server), Index-Organized Table or IOT (Oracle).

Non-Clustered Index

There are databases that make Clustered Indexes compulsory (MySQL InnoDB), those that allow to use them if we want (MSSQL Server) and those that don't support them at all (PostgreSQL). Depending on whether a table is clustered or not we can have a difference in behaviour:

  1. If Clustered Index/Clustered Table exists, then all other indexes on that table will reference that key. So DB will have to first look up keys in Non-Clustered index, retrieve the list of clustered keys, then search in Clustered Index for the records.
  2. If Clustered Index isn't used, then the data is stored in a Heap File. This means that all the indexes have to point to a physical location. So the value of such index is: Page Number+Record Within Page (aka rowid in Postgres).

On the one hand option#2 is faster when searching for data. On the other hand option#1 is faster when updating data. When updating an existing row we may have to move it to another physical location (e.g. it grew in size and didn't fit old space or it's a copy-on-write algorithm like in Postgres):

Why use Clustered Indexes?

Apart from the fact that they may speed up updates (as mentioned above), they can also speed up queries. If we SELECT many records at a time, then it's beneficial if those records are located close to each other. This way DB will simply go over the neighboring records and read them from the disk sequentially (it will read a page at a time - and that page may contain many records that we're interested in). As opposed to jumping across different places on disk. This has a huge boost in performance.

So if our ID column has a Clustered Index and we often need objects with IDs close to each other (which isn't a rare case), then we're in a good place. But if our search patterns don't go hand-in-hand with Clustered Index, then we're better off referencing a Heap File.

Clustering in Postgres

Postgres doesn't support Clustered Index, but it still supports clustering. So while we can't store table rows right in the index, we can ask Postgres to re-sort the data in the Heap File. But when new rows are inserted (or old ones are updated/deleted) they won't be inserted according to the sorting, so at some point we'll get back to the unsorted data.

Again, this will be beneficial only if we request rows close to each other according to the cluster. This is probably a very rare feature to use because Postgres will lock the whole table for this operation. So it will be useful only for a mostly read-only tables.

Primary Index

Primary Index is a weird notion as oftentimes it doesn't differ from any other unique secondary indexes. But in databases that support Clustered Indexes, the one that we mark as Primary Index will often end up to be Clustered as well. This isn't guaranteed though - if you already had a Clustered Index, making another index Primary will not change the clustering.

Covering Index for Index-Only Scan

Queries work fastest when then don't have to access table at all. Meaning the index contains all the necessary data to fully execute query. E.g. if email column is indexed, query select 1 from users where email = ? is covered by that index. But what if we want to select some column, say, first_name, how do we go about it? Relational DBs allow to include more columns in the index storage even if we don't search by those fields. Making the query select first_name from users where email =? covered by that index.

The benefits of this is that if we SELECT some data and all the necessary values are already in the index, then we don't have to go to secondary location for the data. The drawbacks are the usual ones - if we update the column that's covered by the index, then the index will have to updated as well.

When indexes don't work

MVCC & Visibility Maps

Indexes don't always work because of MVCC (multi-version concurrency control) - each time we update a record a new version of that record is created (the old one isn't removed right away). So indexes keep references to all versions - including the old ones. To find out whether the record is outdated the query may have to access table pages anyway. Though these things are often optimized (just as an example: PostgreSQL Visibility Maps).

Too many pages to load

Say you're loading 1K records from a 5K-tall table. Does it make sense for the database to use the index? Unlikely. Loading indexes, then finding which records they reference, then go and load most of the table anyway - that'll work much slower than just going through the whole table sequentially.

Here's a less trivial example. Say you want to load 50 records out of 5K tall table. Should we use index? At first it seems like a good idea, but... what if the whole table is split into 80 physical pages, and each of the 50 records is located on a different page? Then we're back to loading most of the table from disk.

Databases can figure these things out by gathering stats. For instance in Postgres if you look at select * from pg_stats where tablename='your_table', you'll see a column correlation (aka Clustering Factor). It keeps track of how much the order of records in the table correlates with their natural order. If the correlation is close to 0, then the probability that the records are on different pages is high. Otherwise they're probably grouped together on the same page. Databases keep such stats to make a better judgement of whether to use the index or not (and how to use it).

Here's a great article on how Postgres uses indexes and why they may not always work: Queries in PostgreSQL: 4. Index scan.