cmu-db / peloton

The Self-Driving Database Management System
http://pelotondb.io
Apache License 2.0
2.03k stars 624 forks source link

Concurrency control (MVTO) discussion #1420

Open mbutrovich opened 6 years ago

mbutrovich commented 6 years ago

In trying to improve the performance of the CC system, I've identified several design decisions that I'd like to discuss here and possibly change in the future.

  1. We maintain and rely on a doubly-linked version chain. This is at odds with both the MVCC paper:

Since it is not possible to maintain a latch-free doubly linked list, the version chain only points in one direction.

...and our own wiki entry on our CC protocol:

The last meta-data field is the 64-bit pointer that stores the address of the neighboring (previous or next) version (if any) of a particular version.

  1. We add READs to a TransactionContext's RWSet. One of #1402's changes tries to optimize around this, but really it shouldn't be done in the first place. I understand the need to add READ_OWNs since we need to release the write lock on a tuple at transaction end, but a normal READ shouldn't be there. Addressed in #1425.

  2. Our tuple header is "full" at 64 bytes. We could always make the reserved field larger if we needed to, but spilling such a heavily-accessed structure beyond a single cache line sounds bad to me. Removing the doubly-linked version chain would buy us some space, as would exploring atomic operations on timestamps to remove the SpinLatch. I'm not convinced the latter is possible, but I am experimenting with 128-bit CAS implementations.

  3. Reusing owned tuple slots for updates and deletes seems like a premature optimization that sacrifices correctness. As currently implemented, if we repeatedly update a tuple within a single transaction and then abort, we do not have enough information to reconstruct the old versions that need to be pruned from the indexes. If we stick with reusing tuple slots, then it seems like we need more info added to the TransactionContext, like index write sets (or wait for logging).

  4. Also related to reusing owned tuple slots, I can generate an assertion failure in the master branch (debug mode) with the following SQL: Addressed in #1429.

CREATE TABLE test(a INT PRIMARY KEY, b INT); BEGIN; INSERT INTO test VALUES (3, 30); UPDATE test SET a=5, b=40;

Assertion failed: ((GetLastReaderCommitId(tile_group_header, old_location.offset) == current_txn->GetCommitId())), function PerformDelete, file .../peloton/src/concurrency/timestamp_ordering_transaction_manager.cpp, line 525.

This is similar to #1336 and #1337.

  1. We don’t set the last-reader timestamp of a new version to the writer's timestamp. Instead it is 0. This goes against all of the MVTO references I've found, and may be a contributor to issue 5 listed above. Addressed in #1429.

  2. Hybrid index scan is fiddling with CC stuff. This strikes me as legacy code from before we had a proper CC system since it's from 2016. Based on Coveralls it doesn't seem like we exercise this codepath anyway, but anyone outside of the CC system changing tuple headers jumps out at me as concerning.

I'd like to have a discussion on these topics and if necessary come to a conclusion on how to address them.

@yingjunwu @gandeevan

yingjunwu commented 6 years ago

Hi @mbutrovich, nice thought! Most of the problems you mentioned above have been discussed before. You can find my answers below.

  1. We do use doubly-linked version chain. Ideally, a single-direction version chain, which was described in the paper, is enough. However, during the implementation, we found that we have to add the reverse-direction pointers to guarantee the correctness of our latch-free implementation. You may find the reason in my code comments. (I also need to recall why I did this :-))

  2. Yes we only need to track write sets when using MVTO. See my comments in #1402

  3. I remembered that I tried to replace spinlatch with a single CAS operation by compacting the to-be-updated data fields, but the performance didn't change too much. You could try to optimize it, but I am not sure whether it really helps. Also I am not sure how you wanna implement 128-bit CAS. As a main-memory DBMS, I think it would be more interesting to think about whether we can save memory as much as possible.

  4. We never delete older tuple versions when updating the tuple, so I don't think there's any issue if we want to reconstruct the old versions when confronting transaction aborts. A tuple will not be directly removed from index if tuple deletion is performed within a transaction. But I do remember there's any issue with index. Please check my code comments or previous issues. One interesting thought about multi-update to a single tuple is that how to keep track of the updated fields. I think you may want to talk to the logging team about this, as they want to do delta logging.

  5. I thought this was a solved problem, and I am not sure why it happens again. May need to double check.

  6. Same to 5.

  7. The hybrid_index_scan code is not well maintained -- I haven't touched that code for a long time and I didn't really know when to use it. Also, this code should be replaced by the new LLVM code.

I am open to discuss any of these issues, and please let me know if there's anything I can help with. Hope my answers help :-)