ctapobep / blog

My personal blog on IT topics in the form of GitHub issues.
6 stars 0 forks source link

Postgres vs MySQL: differences and challenges of the architecture #24

Open ctapobep opened 2 months ago

ctapobep commented 2 months ago

Both Postgres and MySQL are relational databases. Both are transactional. Both are designed for the OLTP load. So they must be similar inside, right? No at all! In fact they are so different, that it’s hard to believe they solve similar problems.

In this article we’ll go over how they store and look up rows. We’ll first discuss general approaches used in this kind of databases, and then we’ll dive into which approaches these databases chose and what problems they now face.

NB: MySQL storage is actually pluggable - it’s possible to choose a different one with different properties. We’ll be discussing the most popular OLTP storage - InnoDB.

Terminology:

Table of Contents

# How tables and indexes are structured inside Conventional databases store tables in 2 possible forms: either a Heap-organized table or an Index-organized table. Each having their own pros and cons. ## Heap-organized table A Heap is just a file split into logical pages (let's say 8kB per page), and records are added to those pages as we INSERT into the table. Since multiple records can reside on the same page, the full address of any record becomes `(page_no, record_no)`, where `record_no` is the record number *within the page*. There's no particular sorting to those records. While originally we fill the table sequentially and the order coincides with the order of insertion, at some point the old (deleted) records can be removed. This leaves the pages half-full, which makes them ready to accept new records. Finding some specific row in a Heap would be very slow - we’d have to inspect every record whether it matches our criteria. So real-world databases need *indexes* too. These are separate data structures, residing in separate (logical or physical) files. Indexes are *sorted* by some *key* (by a column in a simple case), and finding something in a sorted structure is fast. Very fast. So long as we're searching by that key. So having a query like `select * from users where name = ‘A’`, if we had an index that uses `name` as a key, finding that record is lightning fast. To borrow analogies from different programming languages: * An unsorted array or list of objects would represent a Heap-organized table * And a key-value structure (different languages use different nomenclature: `Map`, `Dict`/ `Dictionary`, `Associative Array`) - these are indexes. Of course, we often need to search by different keys - e.g. our app may need to search users by name in one use case and by their ID in a different use case. So we have to have 2 separate indexes (2 different Maps/Dicts). Each of them is going to be sorted by its own key (column). Maps/Dicts have a key, but they also have a value. They need to reference the actual tuples that reside in the table. If you remember, every record in a Heap has its address in the form `(page_no, record_no)`. So indexes can reference the tuples using this address. And when searching a user by name, the database will first look it up in the respective index, find its physical location \- and only then it'll be able to return the full information about the user. Postgres uses Heap-organized tables. Some other databases (Oracle, MSSQL) are able to use Heaps too, but it’s not the default. ## Index-organized table aka Clustered Index The alternative is to keep the records sorted in the first place. Basically, our table file becomes an index, but that index also contains the actual values. We can have only one such index per table. These indexes are special. We call them Clustered Indexes, but probably a better name would be Cluster*ing* Indexes \- because they sort the data and basically keep the rows with similar keys together (the rows are clustered together). Which of the columns should become the key in such an index? Usually it defaults to the Primary Key of our table. Of course, the column marked as the Primary Key is special - it's basically a *unique* index, and the values can never be null. What if we create other indexes? Those are called Secondary Indexes. And instead of referencing a physical location, they keep Primary Keys of the records. So if we're looking up a user by name, the database would have to first go to the Secondary Index, find the Primary Keys of the matching records, and then it has to make another lookup. This time - in the Clustered Index. BTW, in the case when we were using *Heaps* there was no distinction between types of indexes. All of them were Secondary Indexes. Such storage doesn't give any special meaning to the Primary Key - it's just a yet another Secondary Index, although it's unique and non-null without saying so explicitly. MySQL uses Clustered Indexes. Other databases (Oracle, MSSQL) operate in this mode by default, but it’s not the only choice for them. # Multiversioning (MVCC) implementations Databases need to handle multiple connections and transactions simultaneously. So what happens if one tx modifies a row, while another one reads it? There are 2 choices: 1. Writing tx (Writer) locks the record, and reading tx (Reader) must wait. Once the Writer ends, the Reader can proceed reading the new value. While this approach is supported by all the conventional SQL databases with `select ... for share` (we'll talk about it in Part II), it's not the default because it causes a lot of locks and thus reduces parallelism. 2. Instead of just overwriting the record, the Writer can create a **new version** of it. So Readers can read the old versions until the Writer commits. Since this allows multiple versions of the record to coexist, this is called Multiversioning or Multi-version concurrency control (MVCC). And this is the default in most databases. The fact that most SQL databases implement MVCC, doesn't mean they do it in the same way or even similarly. On the contrary\! There are 2 general approaches: 1. *Copy-on-write* (COW) \- during UPDATE create a *full copy* of the row with the new values. The Readers see the old version. This is the Postgres way. 2. *Undo storage* \- we write the new value right into the table, but we copy the old values into a special storage outside the table. The parallel Readers can reconstruct the version that they want by using the current value in the table plus the data from the Undo storage. Most (if not all) other databases use this approach. But how do Readers know which version of the tuple to use? Simple: each version must store which tx created it. While the algorithm and the exact fields may differ, the general idea is this: 1. Assign each tx a monotonically increasing ID (let’s call it TX\_ID). Could be an integer that increments with each next tx. 2. When a Writer creates a tuple, it writes its TX\_ID into the header of that tuple. If Writer *updates* an existing tuple, then the old version either stays in the heap (*COW*) marked as `deleted_by=TX_ID` or the old attributes are written into the *Undo* storage with their previous TX\_ID. 3. When a tx is started, the database captures all other transactions that are currently in-progress/aborted/committed. This info is stored within every tx, and it's different for every tx. This state of transactions is called a Snapshot (PG) or a Read View (MySQL). 4. When running into a tuple, the Reader knows that it must not see versions created by in-progress or aborted transactions \- so it chooses the version that it can actually see. With the *COW* we have to iterate over all the old/new versions and find only the one we're interested in. With the *Undo* storage we first find the latest version in the table, and then go through the Undo to find which attributes changed, and reconstruct the version we need. This allows Writers not to block Readers, which is good\! But there's a lot of problems and complexity hidden in this generic approach - this is going to differ between implementations (keep reading). # Postgres vs MySQL Finally\! The core of our discussion\! We have 2 popular databases with completely different approaches: * Postgres uses Heaps and Copy-on-write MVCC * MySQL uses Clustered Indexes and the MVCC with the Undo storage Let's see how these approaches compare and what complexity they bring in. ## Deleting old versions So we have all these old/aborted versions of tuples. We probably want to delete them at some point? For that we need to track transactions that end, and after that we can discard old versions of tuples. After all, those old versions aren't reachable by any tx anymore. With the Undo storage it's quite efficient \- all the obsolete tuples are grouped in a single place. We have to figure out which versions are in fact unreachable (some transactions are still in progress and may need them), but at least all the candidates are grouped together. But with the Copy-on-write, each version is a first-class citizen of the table. So we have to traverse the whole table to find the obsolete records\! To solve this problem PG has a special garbage collecting mechanism \- VACUUM. In the systems with heavy load this process can cause problems: it can lag behind the UPDATE/DELETE operations, and it can take considerable CPU power. Both MySQL and PG start a separate thread/process for this clean up procedure. But for PG it’s certainly less efficient. ## Reading old versions Suppose there's some old Reader that needs an old version of the tuple, what complexity goes into actually reading it? In the case of Postgres and its COW it's trivial \- old versions don't differ in any way from the latest version of the tuple. They are all in Heap, they all are equal. You can read about the exact details [in a separate article](https://github.com/qala-io/db-course/blob/master/docs/mvcc.md). In MySQL and its Undo storage it's more complicated: 1. We find the record in the table, we see that this (latest) version shouldn't be visible to us. This version (v5) points to the location of the previous version (v4) in the Undo storage. 2. So we jump there, we see which attributes changed between v4 and v5 (let's say user name changed, but the age didn't), and we merge these versions to get the state at v4. But v4 also may be a not-sufficiently-old version. So this record points to an even older v3 in the Undo storage. 3. Another jump, and we finally found the right version (v3). Now we can merge v4 with v3 (basically removing changes brought by v4). Phew! As you can see, in MySQL this process is more complicated and less efficient. ## Aborting transactions and resurrecting old versions Suppose we issued a lot of UPDATE/DELETE statements, but in the end the tx aborted. How do we return the old versions back? In Postgres it's trivial \- you actually don't do anything with the records. Just mark the tx aborted. Other transactions will see the status, and they will know to skip those versions of the tuples. VACUUM then will have to clean up the aborted versions. So in Postgres tx abort is basically a free operation. But with the Undo storage of MySQL this operation is much more complicated, as it needs to go over the Undo storage and write the old versions back into the table. If we update a million rows, we have to restore a million rows\! This means that the abort command may take considerable time in MySQL. ## Updating indexes Okay, we change an attribute in the table. What happens to the indexes if the attribute is not indexed? What happens if it *is* indexed? This one is tough on both databases. ### Updating indexes in Postgres In Postgres, as you remember, indexes reference physical location. Updating a row will create a new record with a new physical location. So if we have 5 indexes for this table, we need to create index entries for the new versions in addition to keeping the entries of the old versions. Even if the indexed keys didn't change\! Okay, okay, at some point PG introduced an optimization called Heap-only Tuples (HOT). If the new version stayed within the same page as the old version, then we could keep referencing the old version (and thus no changes in the index). When we find the old version, it can point to the new version within the same page \- and this allows us to quickly find the right version. For this to work the pages need to have spare room for extra versions \- and PG has a fillfactor that defines how much we fill the pages initially. Again, this works only if we’re lucky and the new version can fit on the same page, if not \- we still need to update all the indexes. If an attribute, that participates in the index, changes \- then we unconditionally need to change the index \- add a new key that references the new version. But this part has to happen in any possible implementation \- otherwise some transactions won’t be able to find the records. Okay, now what about searching? There's a Reader that found all these entries by some key \- how do we know which version to go for? Or even if it's just one version \- how do we know this tuple is accessible to the current tx? Unfortunately, we don't. We have to actually go through the tuples because the information about which tx created/deleted them is stored only in the Heap. This is very frustrating, because if our index contains all the attributes needed to answer the query without jumping into the Heap (we say the index *covers* the query), we still need to access the Heap because of the MVCC\! It's a huge performance hit\! And to alleviate the pain (partially), Postgres has another optimization. If all the tuples on a page were created by old transactions and are visible to every in-progress tx, then we can mark the whole page as "visible for all". PG keeps this information in a special bitmap called [Visibility Map](https://www.postgresql.org/docs/current/storage-vm.html). So after finding an entry in the index, we can quickly learn if it's on a visible-for-all page, and only if not \- we jump to the Heap and check the tuple itself. And the last thing \- VACUUMing is actually more complicated than we thought: * Before deleting obsolete tuples we need to first delete obsolete index entries. * And it's VACUUM who decides which pages go into the Visibility Map. ### Updating indexes in MySQL MySQL faces similar problems. But instead of keeping a separate Visibility Map (which gets updated not as frequently as one might hope), MySQL keeps the transaction-that-last-updated right in the page. So if there are 10 records on an index page (yes, indexes also go in pages), and we just updated one of them \- we have to update the page with the current TX\_ID. This TX\_ID doesn’t represent each individual entry, its granularity is the whole page. When searching in the index we now can check if that TX\_ID was committed. If yes, and if we know that all transactions before this one are guaranteed to be committed (both MySQL and PG track this), then we know for sure that this entry is accessible (in fact, all the entries on the page are accessible). If not, then we’re forced to jump to the Clustered Index and check the version of that particular record. And if we are looking for a different version, then jump through the Undo and start the whole reconstruction business. We won't go into the details here in Part I, but if 2 transactions are updating the same row, they have to wait on one another. And to do that the database needs to create Locks. In PG it’s easy \- we just lock the heap tuple itself, it's our single source of truth. But MySQL locks index records (both Clustered and Secondary). So when issuing an `update ... where name = 'a'`, and then in another tx we run `update ... where age=18` MySQL somehow needs to know that both transactions arrived at the same record. For that reason MySQL either needs to put locks in all the indexes during the UPDATE, or have a way to dynamically check the locks in unrelated indexes during the SELECT. In fact MySQL does both, depending on which approach it deems more optimal in any particular situation. And the last problem (that I know of) is that indexes are represented by structures called trees (specifically, B-tree in most cases). After the tree is modified it may require rebalancing to keep good performance. But when creating Lock objects, MySQL references a memory address of the index entry `(space_id, page_no, rec_no)`, not the *value* of the key (the key could be changed by a query, so we can’t lock on the key). So imagine a situation: 1. tx1 is writing to idx\_entry1 located in some portion of the tree 2. tx2 modifies a completely unrelated record, but unfortunately this causes the tree to be rebalanced, including the portion where idx\_entry1 is located. 3. During rebalancing the addresses will change, but we hold a lock on that entry. So one tx has to wait on another tx not because they update the same records, but because the index has to be rebalanced. # Summary As you can see Postgres and MySQL InnoDB are very different, even though they solve similar problems. Clearly, they chose different paths in the beginning, which made their lives complicated in different ways. If you’re interested in debugging these databases yourself, here are [some instructions](https://github.com/qala-io/db-course/blob/master/debug-databases.md). But I must warn you \- MySQL code (C++) is *very* convoluted and poorly documented. Postgres source code (C), on the other hand, is well structured, relatively simple and half of it (if not more) is documentation. The reason MySQL is more complicated is partially because the Undo Storage and Clustered Indexes are more complicated architectural solutions with more problems. Another part - they seem to spend a lot of time optimizing every possible use case, while the PG team keeps it simple. But ultimately I think it’s the dev culture of the teams that had the biggest impact - after all no matter how complicated the solution is, it’s no excuse for leaving MySQL source code poorly documented. Without running actual benchmarks it’s not obvious which solution is faster. Most likely, it will depend on the use case. From the architecture itself we can tell that *probably*: * Postgres is more optimal for append-only solutions, where UPDATEs don’t happen. Or if they happen rarely. We should be especially careful updating the old data as this will make Visibility Maps useless. * If, however, UPDATEs happen often and are all over the tables, my hunch is that MySQL might be faster. However, we haven't yet considered the locking mechanisms. MySQL is _very_ generous with locks which reduces the possibility for concurrency. In Part II we’ll discuss the differences in the locking mechanisms and transaction isolation. We’ll see some really weird shit there. Stay tuned!