biocore / LabControl

lab manager for plate maps and sequence flows
BSD 3-Clause "New" or "Revised" License
2 stars 15 forks source link

Are transactions rolled back during failure? #472

Open wasade opened 5 years ago

wasade commented 5 years ago

While observing #470 and #471, it became apparent that transactions may not be honored as expected. In the attached .gif, two plates are added for prepare shotgun libraries. The first plate is added with the correct primer plates and the name "FOO", while the second one has the primer plates juggled and uses the name "BAR". An exception occurs, as described in #471 because the primer plates for "BAR" are incorrect. The .gif then shows an attempt to redo prepare shotgun libraries but this time only with the first plate, using the name "FOO", with the correct primers. We observe the duplicate name violation in #470. Despite the failure, the name "FOO" and possibly other state has been committed to the database.

shotgun_library_index_transactions

AmandaBirmingham commented 5 years ago

So-o-o-o ... I tracked this down, and either this is a BIG problem in the way the code is architected or it is a SMALL problem in the way the process is described to the user via the interface; let's hope the latter.

This issue stems from the fact that although the interface allows the user to select multiple plates to prep from the same screen, the preparation of each plate is in fact tracked as part of a SEPARATE LibraryPrepShotgunProcess:

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/gui/handlers/process_handlers/library_prep_shotgun_process.py#L62-L68

Each of those processes does its create within a transaction, but the transaction is not shared between the two. Thus, in this case, the FOO library prep plate (and all its attendant linked records, process, etc), were successfully created in the db; the BAR library prep plate creation (and attendant blah blah blah) failed but that doesn't actually affect FOO because from the perspective of the back end code, it is as though the user came to the library prep page, said they wanted to make the FOO prep plate, made it successfully, went away, then came back to the library prep page a second time, said they wanted to make the BAR prep plate (and failed). There is no constitutive reason why the two need to be created at the same time, or why the failure of one should affect the another--the gui just lets users make two on one page for convenience's sake.

We can either change the code to make it so that if one plate fails all the others fail also (even though there is no constitutive reason they need to) or we can try to make the gui communicate the situation in a more meaningful way.

AmandaBirmingham commented 5 years ago

Btw, the reason I say that if this is an architecture problem, it is a big one is because this same pattern (looping over the creates) is used in a BUNCH of the process handlers:

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/gui/handlers/process_handlers/gdna_extraction_process.py#L76-L89

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/gui/handlers/process_handlers/library_prep_16s_process.py#L81-L89

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/gui/handlers/process_handlers/normalization_process.py#L54-L62

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/gui/handlers/process_handlers/pooling_process.py#L367-L396

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/gui/handlers/process_handlers/quantification_process.py#L91-L96

AmandaBirmingham commented 5 years ago

Finally, for institutional memory, here is background from my email archives on the "separate processes" architecture choice:

From: Jose Navas Date: Tuesday, January 23, 2018 at 3:59 PM To: "Birmingham, Amanda" Subject: Re: Concern about change to the Labman db schema

That is an actual valid representation of the data, I did not think about it.

The way I thought about the process is that the way they extract is plate A goes to KingFisher 1, plate B goes to KingFisher 2 and then plate A and B go to EpMotion. But I think I misinterpreted the notion of process and creating one process per plate may be a more accurate representation of the data.

I can re-work the DB to go back to the previous representation and adapt the objects and handlers to represent that way of representing the data. Given the amount of changes, I don't think I can have a PR until tonight though.

Does that make sense?

2018-01-23 15:22 GMT-08:00 Birmingham, Amanda : I would have envisioned those two situations (plate A and plate B where extracted during the same extraction process but plate A went to KingFisher 1 and plate B went to KingFisher 2) as two different processes. What is the impetus for treating them as the same process? Best, Amanda

From: Jose Navas Date: Tuesday, January 23, 2018 at 2:44 PM To: "Birmingham, Amanda" Subject: Re: Concern about change to the Labman db schema

Hi Amanda,

One of the main issues that I had with the previous representation is that there is a 1-N relationship between the processes and the compositions. Those wells can be stored in multiple plates, and each of those plates had gone through different Robots. For example, plate A and plate B where extracted during the same extraction process but plate A went to KingFisher 1 and plate B went to KingFisher 2. In that case, if you follow the composition path to the gdna_extraction_process_data you still don't know which of the 2 rows for the given extraction process was used for that sample. To know which one, you have to follow the FK connections from composition to well, then plate and then to gdna_extraction_process_data.

Does that make sense?

If you think that was a mistake, the alternative is to change plate by composition id, but you will still have 2 paths from process to composition, so you will also need to keep that in sync.

Do you have any other idea/alternative?

Thanks!

2018-01-23 13:31 GMT-08:00 Birmingham, Amanda : Hi, Jose, I am examining the changes made to the Labman database schema on Jan 17, and I have a concern about the addition of the gdna_extraction_process_data and library_prep_16s_process_data tables, which include plate_id fields. As mentioned in the earlier write-up, the original schema intentionally tracks process history at the well level rather than the plate level, because tracking at the plate level precludes support for cherry-picking and interleaving of multiple plates. Even if the gdna extraction and 16s library prep processes will never make use of either of these techniques, attaching plate_ids to a process means that provenance info about a process is now stored in two independent ways—both via attaching plate_id(s) to processes and also via setting upstream_process_ids on a (well-level) compositions—and there is no way to know which to believe if they do not lead to the same conclusions. Could you help me understand the reasoning behind this change in strategy for the process history tracking? Thanks, Amanda

-- Jose Navas

-- Jose Navas

charles-cowart commented 5 years ago

IMHO, I wouldn't call it an architectural issue so much as it's a bad pattern that's just repeated many times. I would propose just informing the user of all the failures, or removing the ability to do multiple plates on a single screen.

josenavas commented 5 years ago

Just to provide an alternative and the decision is to make all the process handlers work as a transaction, it is easily achievable using a decorator as it is done in qiita:

https://github.com/biocore/qiita/blob/master/qiita_core/util.py#L115-L122

charles-cowart commented 5 years ago

@josenavas Thanks! I'm surprised we aren't doing that already. I should take another look at the db layer. However, that would still mean that potentially 2-3-4 good plates are going to fail if another plate fails. It doesn't seem like we're sure yet whether we want this to happen.

AmandaBirmingham commented 5 years ago

@charles-cowart I think you've put your finger on it: a key problem here is that I'm not sure what the most desirable functionality actually IS.

AmandaBirmingham commented 5 years ago

@josenavas Thank you! Since you're here :) , I would really appreciate your help understanding how the TRN singleton from sql_connection.py acts when it is used in nested calls.

Specifically, I see that in process.py, the specific processes' create methods do all their work within a TRN context manager, like this:

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/db/process.py#L1434-L1442

(etc) but also that the first thing they do is call process_id = cls._common_creation_steps(user), which uses a TRN context manager of its own:

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/db/process.py#L97-L102

So my question is: does the TRN context manager in _common_creation_steps run its own independent transaction, or does it share the transaction in (for example) LibraryPrep16SProcess?

Like, say the _common_creation_steps function completed successfully, making the process record, but then the LibraryPrep16SProcess steps within that create function's transaction failed and was rolled back ... would the sql executed in _common_creation_steps ALSO be rolled back?

charles-cowart commented 5 years ago

@AmandaBirmingham the TRN object is a singleton: https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/db/sql_connection.py#L839

And I believe the singleton keeps a queue where all SQL statements are stored before being executed in within a transaction: https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/db/sql_connection.py#L412

Hence, the two TRN references are pointing to the same object and would share the transaction.

AmandaBirmingham commented 5 years ago

@charles-cowart :

the TRN object is a singleton

Yup, I got that :) The question I have is about the action of the TRN's context manager functionality, which I think is defined here:

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/db/sql_connection.py#L228-L230

As far as I can tell, the last step in this context manager method is to execute the queries AND COMMIT THEM by running self._connection.commit() (scroll snippet below):

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/db/sql_connection.py#L258-L270

My naive expectation would thus be that when one exits the context manager, as is done at the end of Process._common_creation_steps() below, then the query results are committed:

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/db/process.py#L101-L114

But clearly that would be undesirable behavior since the _common_creation_steps() method is basically always called inside ANOTHER TRN context manager (see example below), and you wouldn't want the inner work to be committed before the outer work in case you needed to roll back both.

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/db/process.py#L1434-L1442

Therefore I assume that I don't understand how the custom _sql_executor context manager stacks commits, and I'm hoping @josenavas (or you @charles-cowart ) will educate me :)

josenavas commented 5 years ago

@charles-cowart that's the correct intuition. Note, however, that Transaction is the custom code that we wrote, since writing transactions with the raw PostgreSQL was getting complex and not friendly with our OOP approach.

TRN is a context manager and it keeps track of the number of contexts entered: https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/db/sql_connection.py#L487-L490

Note that just keeping track of the queries is not enough since you may need to execute the queries to keep doing your work, and you don't want to commit all the operations until all of them are successful.

When the last context is left (self._contexts_entered==1 in the code below), the entire transaction is cleaned up: https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/db/sql_connection.py#L509-L520

Clean up can mean multiple things: if an exception occurred, all the queries done in all the contexts entered are rolled back. If there are queries that are not yet executed, they're executed. If no errors occur in all this, then the changes are committed to the DB:

https://github.com/jdereus/labman/blob/645c798e0094e9962973c2d111c4860b1e8e5b23/labcontrol/db/sql_connection.py#L492-L507

Thus, although the execute_XX methods are called multiple times, those operations are not committed until the last transaction is executed. This makes writing transactions super easy: if you want a set of operations to be part of a single transaction, simply wrap them in a context and it will take care of it.

Responding to last Amanda's comment, I agree that the code looks confusing and the self._connection.commit() at the end of the function is misleading. I don't remember the details and I think that line of code is actually dead code. Since there is a "yield" above, it works as a return and I don't think the commit is ever executed. I would be surprised if that's not the case, since this was a critical part of the transaction functionality and it would prevent to work as expect, and tests are in place to ensure that this is the desired behavior.

Using the specific example. the SQL queries in _common_creation_steps() are executed but not committed.

The context control and only committing when exiting the last context are the basis for the execute_as_transaction decorator - it simply wraps the entire function in on TRN context, forcing the entire function to be executed in a single transaction.

charles-cowart commented 5 years ago

@AmandaBirmingham :D My apologies - you're right! - that really does look like it's committing once we leave the scope of the first 'with' block.

@josenavas Thanks for your explanation Jose - it sounds like a good way to verify the behavior is what we want is to review and likely augment the unit tests for the sql_connection code. IMHO, this layer of coding seems like what would ordinarily be handled by an ORM or similar layer module (SQLAlchemy, etc.), but since we have our own, the best thing is test it and profile it to see what parts of the code are getting touched. (My two $0.02. If everyone agrees, I'll make an issue out of it.)

AmandaBirmingham commented 5 years ago

Thanks, @josenavas , that explanation is extremely helpful!