Open apavlo opened 12 years ago
I'm hooking the Markov model stuff from the VLDB paper back into the system (it got disabled when we re-factored everything in the winter for the SIGMOD paper). We are currently "losing" txns during the PREPARE part of 2PC.
One problem that I didn't foresee is that we moved the transaction handle initialization into the PartitionExecutor's thread. This means that when the SpecExecScheduler
looks for txns to speculatively execute, it doesn't know whether a request is single-partition or not because the InitializeTxnMessage
just contains the raw bytes.
So I think that now because the ClientInterface code is multi-threaded, we should be able still scale-up the system if we move the call to TransactionInitializer.initInvocation()
back into HStoreSite.invocationProcess()
Moving the call of TransactionInitializer.initInvocation()
back into HStoreSite
was easy but now we have another problem. The PartitionExecutor uses a single object ExecutionState
to reuse a bunch of parameters that each txn needs when it runs. The system assumes that only txn is ever in flight at a time, so therefore there is only one ExecutionState
per partition. This wasn't a problem before, because when we would only speculatively execute a txn after the current dtxn has finished (and therefore released the ExecutionState
). So now we are trying to spec exec a txn while the dtxn is still running, which means that it will start overwriting the ExecutionState
.
The way to get around this is to use a non-blocking queue of ExecutionStates
per partition so that we can just grab a new one each time we spec exec a txn. We need to make check whether the PartitionExecutor code use the current txn's state object rather than the "global" one.
Ok we're all set with multiple, reusable ExecutionState
objects in the PartitionExecutor
. The next step is to expand on our read/write sets to figure out what single-partition txn to queue up next in the SpecExecScheduler
whenever PartitionExecutor.utilityWork()
is invokved.
We added the read/write conflict sets for each procedure in the catalog and the ability to keep track of which tables each txn reads/writes, but I think that it is too coarse grained. What we really need to be able to do is somehow keep track of which tables cannot be written to by the current dtxn at a partition without violating ACID guarantees.
We don't need a full-fledged conflict graph to keep track of things at each partition because the distributed txn should be the one that always wins. Here's the way it should work:
ClientResponse
back right away if it's read-only, but if it modified the database we currently can't commit just its changes because of our coarse-grained undo tokens.We need to be more intelligent about how we choose txns to speculatively execute. We should use the Markov models to avoid txns that are are likely to abort (since we have to stop specexec if that txn changed something so that we can roll it back).
We can also try to get crafty and schedule txns based on semantic similarities. For example, since we have the schema tree and we know the ancestors through foreign keys, we can choose txns that access the same "group" in the tree. For example, in the TPC-C benchmark, we can have the SpecExecScheduler
prefer txns that access the same district/warehouse.
Here are the next steps:
Ok here we go! Right now the only time that we will speculatively execute a transaction is when we get the 2PC:PREPARE message for a distributed transaction. We are now going to implement more aggressive scheduling for speculative transactions.
The different places where we want to try to schedule speculative transactions are :
These are essentially any place that we are currently invoking
PartitionExecutor.utilityWork()
. Now the tricky part is that we are going to need to precompute what procedures are non-conflicting with all other procedures so that at runtime we can peek in our queue and quickly grab the ones that we use to fill in the gap.In the first implementation, we are going the catalog to compute coarse-grained conflict sets. This might prove to be enough. We'll have to look at our workload to see how many interesting examples that we have based on that. We can refine our sets using our Markov models from the 2011 VLDB paper to figure out the estimated time for each of the txns and how much time we have until the partition will no longer be idle and try to schedule in the most txns that way.