Aggressive Single-Partition Transaction Speculative Execution

apavlo / h-store

H-Store Distributed Main Memory OLTP Database System

https://hstore.cs.brown.edu

GNU General Public License v3.0

567 stars 175 forks source link

Aggressive Single-Partition Transaction Speculative Execution #70

Open apavlo opened 12 years ago

apavlo commented 12 years ago

Ok here we go! Right now the only time that we will speculatively execute a transaction is when we get the 2PC:PREPARE message for a distributed transaction. We are now going to implement more aggressive scheduling for speculative transactions.

The different places where we want to try to schedule speculative transactions are :

At the base partition when it the txn sends a WorkFragment request to remote partitions and it is blocked waiting for responses over the network.
At the remote partitions after the PartitionExecutor executes a query on behalf of a txn and becomes idle waiting for another query request or the 2PC:PREPARE message.

These are essentially any place that we are currently invoking PartitionExecutor.utilityWork(). Now the tricky part is that we are going to need to precompute what procedures are non-conflicting with all other procedures so that at runtime we can peek in our queue and quickly grab the ones that we use to fill in the gap.

In the first implementation, we are going the catalog to compute coarse-grained conflict sets. This might prove to be enough. We'll have to look at our workload to see how many interesting examples that we have based on that. We can refine our sets using our Markov models from the 2011 VLDB paper to figure out the estimated time for each of the txns and how much time we have until the partition will no longer be idle and try to schedule in the most txns that way.

apavlo commented 12 years ago

I'm hooking the Markov model stuff from the VLDB paper back into the system (it got disabled when we re-factored everything in the winter for the SIGMOD paper). We are currently "losing" txns during the PREPARE part of 2PC.

apavlo commented 12 years ago

One problem that I didn't foresee is that we moved the transaction handle initialization into the PartitionExecutor's thread. This means that when the SpecExecScheduler looks for txns to speculatively execute, it doesn't know whether a request is single-partition or not because the InitializeTxnMessage just contains the raw bytes.

So I think that now because the ClientInterface code is multi-threaded, we should be able still scale-up the system if we move the call to TransactionInitializer.initInvocation() back into HStoreSite.invocationProcess()

apavlo commented 12 years ago

Moving the call of TransactionInitializer.initInvocation() back into HStoreSite was easy but now we have another problem. The PartitionExecutor uses a single object ExecutionState to reuse a bunch of parameters that each txn needs when it runs. The system assumes that only txn is ever in flight at a time, so therefore there is only one ExecutionState per partition. This wasn't a problem before, because when we would only speculatively execute a txn after the current dtxn has finished (and therefore released the ExecutionState). So now we are trying to spec exec a txn while the dtxn is still running, which means that it will start overwriting the ExecutionState.

The way to get around this is to use a non-blocking queue of ExecutionStates per partition so that we can just grab a new one each time we spec exec a txn. We need to make check whether the PartitionExecutor code use the current txn's state object rather than the "global" one.

apavlo commented 12 years ago

Ok we're all set with multiple, reusable ExecutionState objects in the PartitionExecutor. The next step is to expand on our read/write sets to figure out what single-partition txn to queue up next in the SpecExecScheduler whenever PartitionExecutor.utilityWork() is invokved.

apavlo commented 12 years ago

We added the read/write conflict sets for each procedure in the catalog and the ability to keep track of which tables each txn reads/writes, but I think that it is too coarse grained. What we really need to be able to do is somehow keep track of which tables cannot be written to by the current dtxn at a partition without violating ACID guarantees.

We don't need a full-fledged conflict graph to keep track of things at each partition because the distributed txn should be the one that always wins. Here's the way it should work:

Any spec-exec single-partition txn can read any table that the current dtxn has not written to yet. These txns should be able to commit right away; we can send the ClientResponse back right away if it's read-only, but if it modified the database we currently can't commit just its changes because of our coarse-grained undo tokens.
Any spec-exec single-partition txn can read any table that has been modified by the current dtxn if and only if the current dtxn never writes to that table again. One problem I foresee is that I don't think we can currently roll-back a previously executed txn if there were a bunch of other changes that were made before it.

apavlo commented 12 years ago

We need to be more intelligent about how we choose txns to speculatively execute. We should use the Markov models to avoid txns that are are likely to abort (since we have to stop specexec if that txn changed something so that we can roll it back).

We can also try to get crafty and schedule txns based on semantic similarities. For example, since we have the schema tree and we know the ancestors through foreign keys, we can choose txns that access the same "group" in the tree. For example, in the TPC-C benchmark, we can have the SpecExecScheduler prefer txns that access the same district/warehouse.

apavlo commented 12 years ago

Here are the next steps:

Profile the database performance when using the Markov models. We need to know whether things have gotten better or worse since we have changed some much of the internal infrastructure after the VLDB paper.
Measure the performance when allowing us to specexec txns while the distributed txn is still running versus not doing that.
Run a single partition site and figure out whether there is a lot of jitter in the throughput when running voter. We can use modis for this.
Measure performance with the command logger enabled. We just need to verify that this is still not a bottleneck.
See whether we can remove the two synchronized blocks in the ClientInterface.