Concurrent server method calls from different clients corrupt the transaction history

mattmccutchen commented 7 years ago

In a real app that uses server-side transactions, different clients may concurrently call server methods that each try to perform a transaction. These calls run in different fibers, which can interleave when they perform blocking operations. But there's only one tx._transaction_id variable on the entire server, so if the second call starts before the first call finishes, it will try to use the (unrelated) transaction from the first call instead of starting a separate one. I did a simple test and the second call continued to add items to the transaction while the first call was committing it, which corrupted the transaction in such a way that it couldn't be undone. My test had tx.requireUser = false, but as far as I can tell, the same would happen even with tx.requireUser = true and different user IDs.

At a minimum, this package should have a separate notion of the "current server-side transaction" for each client, mirroring the behavior of client-side transactions. For apps where the server uses unblock to process multiple method calls from the same client in parallel, it might make sense for these to have independent transactions too, though I'm not sure what the analogous functionality for client-side transactions would be.

The output of my test:

I20161203-10:13:18.977(-5)? client a: tx.start begin
I20161203-10:13:19.030(-5)? Started "last action" with transaction_id: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.031(-5)? client a: tx.start end
I20161203-10:13:19.031(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.031(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.032(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.032(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.032(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.033(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.033(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.033(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.033(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.034(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.034(-5)? client a: tx.commit begin
I20161203-10:13:19.036(-5)? Beginning commit with transaction_id: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.039(-5)? client b: tx.start begin
I20161203-10:13:19.041(-5)? An attempt to start a transaction ("undefined") was made when a transaction was already open. Open transaction_id: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.041(-5)? client b: tx.start end
I20161203-10:13:19.044(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.045(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.045(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.046(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.047(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.048(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.050(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.050(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.051(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.051(-5)? Pushed insert command to stack: eZSBkRwoGQuQaSyRW
I20161203-10:13:19.052(-5)? client b: tx.commit begin
I20161203-10:13:19.052(-5)? client b: tx.commit end
I20161203-10:13:19.228(-5)? Executed insert
I20161203-10:13:19.242(-5)? Executed insert
I20161203-10:13:19.261(-5)? Executed insert
I20161203-10:13:19.272(-5)? Executed insert
I20161203-10:13:19.282(-5)? Executed insert
I20161203-10:13:19.300(-5)? Executed insert
I20161203-10:13:19.309(-5)? Executed insert
I20161203-10:13:19.328(-5)? Executed insert
I20161203-10:13:19.345(-5)? Executed insert
I20161203-10:13:19.356(-5)? Executed insert
I20161203-10:13:19.368(-5)? Executed insert
I20161203-10:13:19.379(-5)? Executed insert
I20161203-10:13:19.389(-5)? Executed insert
I20161203-10:13:19.401(-5)? Executed insert
I20161203-10:13:19.411(-5)? Executed insert
I20161203-10:13:19.436(-5)? Executed insert
I20161203-10:13:19.450(-5)? Executed insert
I20161203-10:13:19.459(-5)? Executed insert
I20161203-10:13:19.480(-5)? Executed insert
I20161203-10:13:19.491(-5)? Executed insert
I20161203-10:13:19.502(-5)? Commit reset transaction manager to clean state
I20161203-10:13:19.507(-5)? client a: tx.commit end
I20161203-10:13:40.402(-5)? Exception while invoking method '_meteorTransactionsUndo' TypeError: Cannot read property 'findOne' of undefined
I20161203-10:13:40.403(-5)?     at packages/babrahams_transactions.js:990:47
I20161203-10:13:40.403(-5)?     at Array.forEach (native)
I20161203-10:13:40.405(-5)?     at Function._.each._.forEach (packages/underscore.js:139:11)
I20161203-10:13:40.405(-5)?     at [object Object].Transact._checkTransactionFields (packages/babrahams_transactions.js:976:5)
I20161203-10:13:40.405(-5)?     at [object Object].Meteor.methods._meteorTransactionsUndo (packages/babrahams_transactions.js:1524:14)
I20161203-10:13:40.406(-5)?     at maybeAuditArgumentChecks (packages/ddp-server/livedata_server.js:1711:12)
I20161203-10:13:40.406(-5)?     at packages/ddp-server/livedata_server.js:711:19
I20161203-10:13:40.406(-5)?     at [object Object]._.extend.withValue (packages/meteor.js:1122:17)
I20161203-10:13:40.406(-5)?     at packages/ddp-server/livedata_server.js:709:40
I20161203-10:13:40.407(-5)?     at [object Object]._.extend.withValue (packages/meteor.js:1122:17)

JackAdams commented 7 years ago

Thanks, Matt. You're absolutely right. Massive design flaw in this package. I've observed this happening in a production app and it's not pretty.

I'm wide open to suggestions about how to fix this, but freely admit I don't have the depth of technical knowledge to implement this properly. (A package like this, dealing with db mutations and race conditions, really shouldn't be written and maintained by a high school math teacher hobby coding in evenings and weekends.)

In reality, a from-the-ground-up re-think and re-write is probably required.

mattmccutchen commented 7 years ago

This issue by itself shouldn't take much re-writing to fix (although I don't know what other problems might lurk that we might run into over time). One would take all the properties that make up the "transaction manager state", which are currently directly on tx, and move them to a separate class, perhaps called TxState. On the client, the tx object would have just one TxState, while on the server, it would have a map from Connection to TxState. tx would have a helper method to get the correct TxState for the currently executing code: on the client, the single one, or on the server, looked up based on Meteor._CurrentInvocation.get().connection (equivalent to this.connection in the Meteor method, but our code won't have access to this of the Meteor method). My team is planning to use this package to add undo to our app, so I'll probably implement this fix at least for our own use, but I'm not sure how much additional work might be needed to meet your standards (e.g., tests).

JackAdams commented 7 years ago

Well, that's interesting. That's exactly where my mind went once I'd thought about the issue for a bit. I was planning to start separating out the code into two classes later today, do the map from connection to TxState instance and everything! However, if you're planning on doing this anyway ... :-) I imagine you'll make better design choices. Like I said, I'm not a full-time developer and you seem to have grokked the workings of the package pretty well already. The only standard I set for the package before each release is that it passes the (fairly limited) set of integration tests set up via the test app included in the package.

JackAdams commented 7 years ago

Hi Matt, Were you planning to go ahead with this refactor? If not, I'll probably start work on it in a week or two's time. If you are going to do it, there's an open pull request I should probably merge before you start. Cheers Brent

mattmccutchen commented 7 years ago

Sorry for the slow response. I'm not going to make the change unless and until I get our app to work with the transactions package in all other respects. I'm running into many issues, and it could be several more days before I know. You can start on the change now if you like. I'll check with you before starting on it myself.

JackAdams commented 7 years ago

Okay. I'll have a crack at it over the next week or two. I'm interested in hearing about the other issues with the package, as I'm using it in production at the moment and these will, no doubt, come back to bite me at some stage.

Honestly, having had a quick look at your spreadsheet app, I don't know if this package has the goods to support what you're implementing. It's pretty limited in its set of capabilities (only supports a handful of mongo operators, no multi: true, no upserts, etc.).

I am very grateful that you've highlighted this issue. It's something that does need immediate attention.

mattmccutchen commented 7 years ago

Honestly, having had a quick look at your spreadsheet app, I don't know if this package has the goods to support what you're implementing. It's pretty limited in its set of capabilities (only supports a handful of mongo operators, no multi: true, no upserts, etc.).

Indeed. My assessment was that I would probably be better off adding the missing features than starting from scratch and getting bogged down in tons of design decisions (this is a weakness of mine). If I could upstream the enhancements and benefit from any future improvements to the package that are relevant to us (I don't know if that's likely), that's a bonus. However, there are a few downsides:

Without fully understanding the original design, it's hard for me to make the new features work in the full range of scenarios the existing code attempts to support. At first I was trying to do that, but it became impractical and I started to hold my nose and write code that only works in our scenarios. That's still a fine solution for my app but means we're farther from being able to upstream the changes.
Performance may be poorer and debugging may be more difficult than in a simpler package that does just what we need.

Unfortunately, it's impossible to know in advance whether I made the right decision. I'll keep you posted.

mattmccutchen commented 7 years ago

OK, I believe I have all the other issues worked out for the purpose of our app (I'll be filing issue reports soon), so I just need the concurrency fix before I'll be ready to merge the babrahams:transactions adoption to master of our app. @JackAdams, have you made any progress yet? If not, I'll start on the concurrency fix myself. I'm not worried about conflicts with #66; it doesn't look like they'll be hard to resolve.

JackAdams commented 7 years ago

Just yesterday I got an internet connection working at the remote place I'm staying. I'll get to work on this package today. Let me have a go at doing the concurrency fix and, if I get stuck, I'll reach out for help.

JackAdams commented 7 years ago

Hi Matt, I've been working on this and can pretty much see my way through, but I'm having trouble with the key part: Meteor._CurrentInvocation.get().connection -- I just can't seem to find it anywhere. I'll take a look at the Meteor source in a minute, but just thought I'd see if you have any pointers about how to get the unique identifier for the current connection when in an arbitrary context on the server. Cheers Brent

JackAdams commented 7 years ago

Sorry, ignore that. Should have just looked at the source without bothering you. It's DDP._CurrentInvocation not Meteor._CurrentInvocation.

JackAdams commented 7 years ago

Closing in on a solution. Just have to make sure transient transaction instances are cleaned up properly so we don't get memory leaks.

JackAdams commented 7 years ago

I hope this (0.8.0) works. It's passing all tests and I've given it a run in a pretty demanding app (in terms of complex transactions). I haven't published to Atmosphere yet, but will soon.

Use tx.showConnection = true; somewhere in common code to see which connection is being used (shown alongside each message being logged to the console from the transient, per-connection transaction objects).

JackAdams commented 7 years ago

Okay. This (0.8.0) is published on atmosphere. Thanks again for bringing this issue up and giving me the necessary push (and idea -- i.e. DDP._CurrentInvocation) to get it fixed. It's of pretty fundamental importance to the package working reliably.

mattmccutchen commented 7 years ago

Belated confirmation that the fix is working in our app. Thanks! Since I got the initial integration working on December 22, I've been working intensely on our app: I built an extra feature ("transactional publishing"; I'll write about it on forums.meteor.com soon) and did major reworking to reduce the performance hit of transactions and make the related code understandable (it had become unmanageably complex). I finally finished today, so I'll file the rest of the issues.

JackAdams commented 7 years ago

Hi Matt, Thanks for the confirmation. I'm really interested in your work -- looking forward to the forum post and the issues you'll file. Is the reduction of the performance hit specific to your app, or is it something that could be generalized and made a part of this package? Cheers Brent

mattmccutchen commented 7 years ago

Here is the promised writeup on transactional publishing.

Is the reduction of the performance hit specific to your app, or is it something that could be generalized and made a part of this package?

In essence, our app was making repeated writes to the same documents during a transaction, which were already slow but became even slower with this package. Now our app works on temporary collections and then sends the diff down to the original collections (implementation). This technique is potentially useful for any app with a notion of a transaction that makes repeated writes to the same documents during a transaction, whether or not the app uses this package for transaction history and recoverability. Of course, if the app uses this package and uses nonlinear history, the diffing may change the semantics of the saved transaction. I'm not sure whether it would make sense to offer the diffing technique in this package. See also #81.

JackAdams / meteor-transactions

Concurrent server method calls from different clients corrupt the transaction history #63