Closed oliverlockwood closed 3 years ago
I see the very latest version (7.2.0) has this:
Added
db.waitForPropagation
methodThis method helps with setting up databases in a cluster scenario by waiting for a request to succeed on every known coordinator.
Perhaps transactions in a clustered scenario with ROUND_ROBIN
would have to take this kind of approach? Or is there a better way?
@dothebart Any idea what might be causing this? I thought coordinators forward requests to the correct coordinator when dealing with a streaming transaction. I could make the host sticky for transactions but this wouldn't solve the problem when working with third-party load balancers.
Yes, performance wise talking to the same coordinator in subsequent requests is the best choice. However, coordinators should forward the requests to the respective hosts that "own" the transaction. I'm completely missing any information about the installation itself? Version? OS? etc, see the main arangodb issue template.
@dothebart The cluster is running version 3.7.2 of ArangoDB, downloaded from https://download.arangodb.com/arangodb37/DEBIAN/amd64/arangodb3_2.7.2-1_amd64.deb
We wrap this in our own Docker image, which then runs under Ubuntu 18.04, with separate containers for server, co-ordinator and agents. We run an instance of each of these on a given node, and have 3 nodes in total. Please advise if you require further details.
the coordinators need to be able to talk to each others in conjunction to forward requests. try that from within the containers.
@dothebart can you clarify, please - are you asking me to check connectivity between the different co-ordinator containers?
Yes, that was what I was asking for. i.e. Cursors live on one coordinator. If you hit the wrong one, it has to forward your request. If its not able to reach the other coordinator (either by not knowing the right name or tcp connectivity issues) it won't be able to forward it.
Replying on @oliverlockwood's behalf here:
@dothebart the coordinators can indeed talk to each other - I've verified that all three can reach all three from within the container. I've used IP addresses and ports as listed in the list of coordinators in the UI for this.
Hi, can you upgrade to 3.7.6 - we have at least one fix in there regarding the forwarding between coordinators.
We're having this same problem on 3.7.6, but only if we use 'exclusive' collections. We can reliably reproduce this by using ArangoDB Starter in local test mode (./arangodb --starter.local
on a local developer machine) and the following script:
async function test() {
async function insertData(db: Database) {
const trx = await db.beginTransaction({
// swapping 'write' for 'exclusive' works perfectly, but we need exclusive transactions, so...
// write: ['test']
exclusive: ['test'],
}, {
});
await trx.step(() => db.collection('test').save({ enabled: true }));
await trx.commit();
}
// Simulate multiple concurrent requests in a server.
// This reliably causes 'transaction not found'
const promises = [] as Promise<any>[];
for (let i = 0; i < 40; i++) {
promises.push(insertData(db));
}
await Promise.all(promises);
}
Notably, if we exchange exclusive
for write
then there's no problem.
It's a consistent issue in production, to the point of us having to fall back to server-side transactions which work fine with exclusive collections. It's not a tenable approach in the long-term, so if there's any other advice we'd love to hear it.
@dothebart That doesn't sound like something arangojs can address to me. As I understand it, which coordinator owns a transaction should be irrelevant to the client as requests should be forwarded appropriately even when using exclusive collections.
Yes @pluma , thats exactly the case. With 3.7.6 we added another bugfix for this.
but this has nothing to do with write/write conflicts, and locking the collections therefore is the right way to work with this; however the server side transaction isn't a wrong aproach either.
So unless you can find in your environmnent a reproduceable reason why coordinators wouldn't forward requests between each others, I don't see any way to continue here other than closing.
Please try using tcpdump / ngrep / wireshark to find out why forwarding request forwarding doesn't work and report back. If DNS is in the game, this could also be a source for problems.
To clarify my original comment, I was able to reproduce this locally, on the same machine as the running script, with the client connected via localhost, using ArangoDB starter in local test mode, which should rule out any network-related problems.
[some time passes as I want confirm something but don't want to lose my train of thought]
However, I now have reason to believe that this appears to be an issue with arangojs
itself. If I modify my example so that it only runs one transaction at a time in a process:
for (let i = 0; i < 40; i++) {
await insertData(db);
// TEMP: test with only 1 txn at a time
// promises.push(insertData(db));
}
And spin up multiple different processes all operating in parallel:
# bash script that runs test file
$ ./run-test.sh & ./run-test.sh & ./run-test.sh & ./run-test.sh & ./run-test.sh & ./run-test.sh & ./run-test.sh
All transactions complete successfully.
Given the evidence, is it possible that arangojs
is accidentally sending an already-completed transaction ID for some requests? I can imagine combining a database connection pool and header-based transaction IDs as causing this kind of issue within a single process, as its possible for a transaction ID header to be swapped in/out at the last second for another request. Or, maybe it's something else entirely: I don't know how arangojs handles it interally, and I'm completely unqualified to give recommended solutions!
Addeum: I suspect the reason that I noticed this with exclusive
collection transactions is because those tend to take longer to complete, which may indicate to the cause... somehow...
ok, thanks - one more precision - local cluster or local single server instance? And if you've tried with a cluster - will it happen with single server too?
The above results were tested using a local cluster.
I've rerun my tests with a local single server instance and those results follow below:
Attempting 10 concurrent transactions in a single process fails with "transaction not found" after ~70 seconds. Only 3 documents were written to the collection.
Attempting 10 sequential transactions within a single process completes successfully and almost immediately: ~2s, and most of that script start-up time.
Attempting 10 sequential transactions within 4 concurrent processes completes successfully within ~6s.
I can imagine combining a database connection pool and header-based transaction IDs as causing this kind of issue within a single process, as its possible for a transaction ID header to be swapped in/out at the last second for another request.
This doesn't seem likely. The header is set and cleared synchronously and JavaScript is single-threaded.
FWIW the header is read when Connection::request
is invoked and then stored in the Task
that gets pushed onto the connection's queue. This method is generally invoked synchronously (except in a handful of methods that perform multiple requests) so the header can't leak beyond the single request. There's no difference between round-robin
and none
load balancing as far as this part of arangojs (i.e. the task construction) is concerned.
EDIT: Wait, I just re-read your original question. Are you still on 6.x? The above is true for 7.x but I'll have to see how 6.x might differ.
EDIT2: Okay, I don't see any reason this should be a bug in 6.x either but please try reproducing the issue in 7.x first.
I was able to replicate this based on your code (with adjustments):
async function test() {
async function insertData(db) {
const started = Date.now();
const trx = await db.beginTransaction(
{
// swapping 'write' for 'exclusive' works perfectly, but we need exclusive transactions, so...
// write: ['test']
exclusive: ["test"],
},
{}
);
console.log(trx.id, "begun");
await trx.get();
try {
await trx.step(() => db.collection("test").save({ enabled: true }));
} catch (e) {
console.log(
trx.id,
"failed to save after",
Date.now() - started,
"ms:",
String(e)
);
return;
}
try {
await trx.commit();
} catch (e) {
console.log(
trx.id,
"failed to commit after",
Date.now() - started,
"ms:",
String(e)
);
return;
}
console.log(trx.id, "committed");
}
const db = require(".")({
url: "http://localhost:8530",
loadBalancingStrategy: "NONE",
});
await db.acquireHostList();
console.log(db._connection._hosts);
// Simulate multiple concurrent requests in a server.
// This reliably causes 'transaction not found'
const promises = [];
for (let i = 0; i < 40; i++) {
promises.push(insertData(db));
}
await Promise.all(promises);
}
test();
It doesn't matter if I do "NONE"
or "ROUND_ROBIN"
, the result is the same:
[
[Function: request] { close: [Function: close] },
[Function: request] { close: [Function: close] },
[Function: request] { close: [Function: close] },
[Function: request] { close: [Function: close] }
]
1099543673032 begun
1099543673036 begun
1099543673048 begun
1099543673040 begun
1099543673044 begun
1099543673052 begun
1099543673056 begun
1099543673064 begun
1099543673068 begun
1099543673060 begun
1099543673072 begun
1099543673076 begun
1099543673084 begun
1099543673080 begun
1099543673092 begun
1099543673088 begun
1099543673100 begun
1099543673096 begun
1099543673108 begun
1099543673104 begun
1099543673116 begun
1099543673112 begun
1099543673124 begun
1099543673120 begun
1099543673132 begun
1099543673128 begun
1099543673140 begun
1099543673136 begun
1099543673148 begun
1099543673144 begun
1099543673156 begun
1099543673160 begun
1099543673152 begun
1099543673164 begun
1099543673168 begun
1099543673176 begun
1099543673172 begun
1099543673184 begun
1099543673180 begun
1099543673188 begun
1099543673052 failed to save after 11055 ms: ArangoError: transaction '1099543673052' not found
1099543673044 failed to save after 11057 ms: ArangoError: transaction '1099543673044' not found
1099543673064 failed to save after 11057 ms: ArangoError: transaction '1099543673064' not found
1099543673056 failed to save after 11058 ms: ArangoError: transaction '1099543673056' not found
1099543673060 failed to save after 11059 ms: ArangoError: transaction '1099543673060' not found
1099543673068 failed to save after 11060 ms: ArangoError: transaction '1099543673068' not found
1099543673076 failed to save after 11061 ms: ArangoError: transaction '1099543673076' not found
1099543673072 failed to save after 11062 ms: ArangoError: transaction '1099543673072' not found
1099543673080 failed to save after 11064 ms: ArangoError: transaction '1099543673080' not found
1099543673084 failed to save after 11065 ms: ArangoError: transaction '1099543673084' not found
1099543673088 failed to save after 11065 ms: ArangoError: transaction '1099543673088' not found
1099543673092 failed to save after 11066 ms: ArangoError: transaction '1099543673092' not found
1099543673096 failed to save after 11067 ms: ArangoError: transaction '1099543673096' not found
1099543673100 failed to save after 11068 ms: ArangoError: transaction '1099543673100' not found
1099543673104 failed to save after 11069 ms: ArangoError: transaction '1099543673104' not found
1099543673108 failed to save after 11070 ms: ArangoError: transaction '1099543673108' not found
1099543673112 failed to save after 11071 ms: ArangoError: transaction '1099543673112' not found
1099543673116 failed to save after 11072 ms: ArangoError: transaction '1099543673116' not found
1099543673120 failed to save after 11073 ms: ArangoError: transaction '1099543673120' not found
1099543673132 failed to save after 11073 ms: ArangoError: transaction '1099543673132' not found
1099543673124 failed to save after 11073 ms: ArangoError: transaction '1099543673124' not found
1099543673128 failed to save after 11074 ms: ArangoError: transaction '1099543673128' not found
1099543673140 failed to save after 11075 ms: ArangoError: transaction '1099543673140' not found
1099543673148 failed to save after 11076 ms: ArangoError: transaction '1099543673148' not found
1099543673136 failed to save after 11077 ms: ArangoError: transaction '1099543673136' not found
1099543673156 failed to save after 11078 ms: ArangoError: transaction '1099543673156' not found
1099543673144 failed to save after 11079 ms: ArangoError: transaction '1099543673144' not found
1099543673152 failed to save after 11079 ms: ArangoError: transaction '1099543673152' not found
1099543673160 failed to save after 11080 ms: ArangoError: transaction '1099543673160' not found
1099543673168 failed to save after 11082 ms: ArangoError: transaction '1099543673168' not found
1099543673164 failed to save after 11083 ms: ArangoError: transaction '1099543673164' not found
1099543673172 failed to save after 11084 ms: ArangoError: transaction '1099543673172' not found
1099543673176 failed to save after 11084 ms: ArangoError: transaction '1099543673176' not found
1099543673180 failed to save after 11085 ms: ArangoError: transaction '1099543673180' not found
1099543673184 failed to save after 11086 ms: ArangoError: transaction '1099543673184' not found
1099543673032 failed to commit after 11090 ms: ArangoError: transaction was already aborted
1099543673188 failed to save after 11087 ms: ArangoError: transaction '1099543673188' not found
1099543673040 committed
1099543673048 committed
1099543673036 committed
That the "not found" error messages include the correct transaction numbers indicates that the problem has nothing to do with the headers.
@dothebart I used scripts/startLocalCluster.sh
with 3 coords/dbs/agents and created a document collection called test
via the web UI. What strikes me as suspicious is that all of the transactions seem to stall for about 10 seconds while executing the insert. The three successful commits only occur after the rest has failed and one of the failures happens during the commit, suggesting the transaction went away.
Also note: this only happens when I use exclusive
. If I change that to write
, it returns almost immediately. I'm testing against the 3.7 branch, btw. To me this seems like the exclusive transactions run into a deadlock and time out.
Also note: I can even replicate this with a single coordinator. I strongly suspect this is a locking issue.
EDIT: The above has the first coordinator twice in the connection pool as locally the agency returns [::1]
instead of localhost
but that's orthogonal to the problem.
EDIT2: With a standalone server I'm seeing exactly 10 seconds between each "begun" log entry and then timeouts after about 30ms (this is with 6 instead of 40 transactions because I didn't want to wait 400 seconds, I've also adjusted the output to include Date()
):
Wed Feb 10 2021 13:48:10 GMT+0100 (Central European Standard Time) [ [Function: request] { close: [Function: close] } ]
Wed Feb 10 2021 13:48:10 GMT+0100 (Central European Standard Time) 329 begun
Wed Feb 10 2021 13:48:22 GMT+0100 (Central European Standard Time) 332 begun
Wed Feb 10 2021 13:48:32 GMT+0100 (Central European Standard Time) 331 begun
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 344 begun
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 329 failed to save after 31741 ms: ArangoError: transaction '329' not found
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 332 failed to save after 31741 ms: ArangoError: transaction '332' not found
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 331 failed to save after 31741 ms: ArangoError: transaction '331' not found
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 344 saved
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 344 committed
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 330 begun
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 330 saved
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 330 committed
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 337 begun
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 337 saved
Wed Feb 10 2021 13:48:42 GMT+0100 (Central European Standard Time) 337 committed
Here's the 3x3x3 cluster version with Date()
timestamps:
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) [
[Function: request] { close: [Function: close] },
[Function: request] { close: [Function: close] },
[Function: request] { close: [Function: close] },
[Function: request] { close: [Function: close] }
]
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667948 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667952 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667964 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667956 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667960 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667968 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667972 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667980 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667984 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667976 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667988 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667992 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667996 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668000 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668004 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668008 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668012 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668016 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668020 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668024 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668028 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668032 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668036 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668040 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668044 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668048 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668052 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668056 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668060 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668064 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668068 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668072 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668076 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668084 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668080 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668092 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668088 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668100 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668096 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511668104 begun
Wed Feb 10 2021 13:54:18 GMT+0100 (Central European Standard Time) 1099511667948 saved
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667952 saved
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667960 failed to save after 10666 ms: ArangoError: transaction '1099511667960' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667968 failed to save after 10665 ms: ArangoError: transaction '1099511667968' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667972 failed to save after 10666 ms: ArangoError: transaction '1099511667972' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667980 failed to save after 10667 ms: ArangoError: transaction '1099511667980' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667984 failed to save after 10667 ms: ArangoError: transaction '1099511667984' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667976 failed to save after 10669 ms: ArangoError: transaction '1099511667976' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667988 failed to save after 10669 ms: ArangoError: transaction '1099511667988' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667992 failed to save after 10670 ms: ArangoError: transaction '1099511667992' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667996 failed to save after 10671 ms: ArangoError: transaction '1099511667996' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668000 failed to save after 10671 ms: ArangoError: transaction '1099511668000' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668004 failed to save after 10672 ms: ArangoError: transaction '1099511668004' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668008 failed to save after 10673 ms: ArangoError: transaction '1099511668008' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668012 failed to save after 10674 ms: ArangoError: transaction '1099511668012' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668016 failed to save after 10675 ms: ArangoError: transaction '1099511668016' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668020 failed to save after 10676 ms: ArangoError: transaction '1099511668020' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668024 failed to save after 10677 ms: ArangoError: transaction '1099511668024' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668028 failed to save after 10677 ms: ArangoError: transaction '1099511668028' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668032 failed to save after 10678 ms: ArangoError: transaction '1099511668032' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668036 failed to save after 10679 ms: ArangoError: transaction '1099511668036' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668040 failed to save after 10680 ms: ArangoError: transaction '1099511668040' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668044 failed to save after 10681 ms: ArangoError: transaction '1099511668044' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668048 failed to save after 10682 ms: ArangoError: transaction '1099511668048' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668052 failed to save after 10683 ms: ArangoError: transaction '1099511668052' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668056 failed to save after 10684 ms: ArangoError: transaction '1099511668056' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668060 failed to save after 10685 ms: ArangoError: transaction '1099511668060' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668064 failed to save after 10685 ms: ArangoError: transaction '1099511668064' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668068 failed to save after 10686 ms: ArangoError: transaction '1099511668068' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668072 failed to save after 10687 ms: ArangoError: transaction '1099511668072' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668076 failed to save after 10688 ms: ArangoError: transaction '1099511668076' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668084 failed to save after 10690 ms: ArangoError: transaction '1099511668084' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668080 failed to save after 10691 ms: ArangoError: transaction '1099511668080' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668092 failed to save after 10691 ms: ArangoError: transaction '1099511668092' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668088 failed to save after 10692 ms: ArangoError: transaction '1099511668088' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668100 failed to save after 10693 ms: ArangoError: transaction '1099511668100' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668096 failed to save after 10694 ms: ArangoError: transaction '1099511668096' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511668104 failed to save after 10695 ms: ArangoError: transaction '1099511668104' not found
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667948 failed to commit after 10699 ms: ArangoError: transaction was already aborted
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667952 committed
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667956 saved
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667964 saved
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667956 committed
Wed Feb 10 2021 13:54:29 GMT+0100 (Central European Standard Time) 1099511667964 committed
Note that the first save happens within the same second as all the "begins" and the failures all happen within the same second as the second save.
EDIT: This behavior seems to be somewhat stable in this setup for me. Even after fixing the duplicate coordinator in the connection pool by replacing "localhost" with [::1]
in the initial URL.
EDIT2: Okay, I don't see any reason this should be a bug in 6.x either but please try reproducing the issue in 7.x first.
We're using 7.2 here!
NB: different entity than thread starter, but identical problem. :)
@radicaled I was able to reproduce the problem as above and added a branch with a testcase which @dothebart is investigating.
@radicaled After checking back with the team I was able to confirm that exclusive transactions will block on begin
until they can acquire a lock or timeout. So what is likely happening is that the first begin
acquires the lock and returns but then immediately the arangojs connection pool fills up with all he parallel requests for beginning transactions that need to acquire the same lock, preventing the subsequent steps of the first transaction from being passed to the connection pool.
The naive solution is to increase the connection pool limit by setting the agentOptions.maxSockets
config option to a higher value. The problem with this approach is that it just postpones the problem until you hit the limit again.
The better solution in my mind is to evaluate if you really need to run this many exclusive transactions in parallel.
As by their nature exclusive transactions can only be processed sequentially (i.e. each consecutive begin
will block until it times out or the running transaction commits/aborts) there is no benefit to running them in parallel and even with a larger connection pool you may still run into timeouts on the later transactions as their wait time will be the sum of the execution and wait times of all prior transactions initiated in parallel to them.
If you don't need the isolation guarantees of exclusive transactions you could switch to regular transactions instead. Otherwise I'd recommend trying to perform them sequentially rather than in parallel, if possible. Depending on your use case you could also switch to using JS transactions, which only require a single request to initiate and execute.
@pluma we've seen this issue with pretty low load levels (one or two human users initiating operations which make use of transactions, through a UI). There's unlikely to be a significant number of transactions concurrently in flight, so I'm not convinced this is the problem (or at least, not the only problem in this area).
As I think I mentioned earlier, we've switched to server-side transactions to be able to move forward, but I still strongly believe that the underlying issues should be fixed if the system aims to support the other transactional model.
@pluma
While the reproduction sample I provided is high concurrency, the problem first occurred in my development environment under very low traffic situations: there were less than 5 users active in the development environment when the problem was first revealed. Afterward I was able to reproduce it sporadically locally myself, as a single user, but sporadic isn't good enough a test-case in a Github issue. The real-life transaction that first brought my attention to this issue to my attention was a fairly simple 1-read, 1-write exclusive transaction, under very low traffic circumstances, with very few possible concurrent transactions pending via arangojs.
Since these transactions are fine when done on via server-side transactions -- and also blazing fast -- it seems feels like this is a problem with streaming transactions in general. Even in the high concurrency example, I would expect arangojs or ArangoDB to process them sequentially until the queue is drained or some arbitrary timeout hits. And given that the example reproduction is a single collection write, I wouldn't expect any timeouts.
So, I guess I'm asking, what's the end analysis here? Can arangojs only "queue up" 1-2 exclusive streaming transaction at a time, and its expected that all other transaction attempts submitted will fail with a 30 second timeout?
On the subject of alternative approaches:
We require an exclusive lock on the collection to avoid phantom reads, and this is an operation that happens as a result of a client interfacing with the system via a UI, so creating an entirely new sequential job pipeline to run exclusive transactions is overkill and may not even work for certain client operations without a large change to how we communicate with those clients. We don't forsee any scaling problems given the circumstances in which these exclusive transactions are used.
On the subject of server-side transactions:
The server-side version of our transactions works fine and is incredibly fast, well within tolerance, and as expected given how simple our queries are: mostly reads and usually only 1 write. The pain this issue is causing us is an inability to re-use code: we're living in TypeScript land over here, with a considerable amount of pre-existing code for querying and transforming data. Not only do we lose that with server-side transactions, they're also effectively a second copy of some elements of our business logic, so we have to maintain two separate versions of the same logic that are implicitly tied together.
@radicaled There's a quirk in arangojs at the moment that the connection pool is actually always 2x the number of sockets. The idea behind this is to increase throughput when using keepalive so sockets don't go idle when there are still requests in the queue. I believe that this may be causing problems when using multiple exclusive transactions side-by-side as steps associated with a transaction that has already started may get stuck behind other transactions waiting for that one to complete, creating a deadlock.
I'm going to look into if disabling this behavior reduces this problem. However depending on how long your transactions run and how frequently you start new ones, you may still run into deadlocks, e.g.:
Let's say you have 8 sockets and 10 transactions, so 8 of the 10 begin
requests hit the connection pool at roughly the same time. The first transaction's begin
completes, freeing that socket. But because you still have 2 more begin
requests in the queue for the connection pool to process, that socket will immediately be filled by that request, leaving you unable to send any other requests until one or more of the begin
requests time out.
There's no way to completely avoid this problem on your end other than ensuring the maxSockets
is always greater than the number of parallelly requested transactions in your application.
I have an idea for how arangojs might accommodate this but it would require making some assumptions that will not always hold depending on how arangojs is used, so it would probably have to be opt-in.
@pluma that will be interesting to see, but from a purely theoretical perspective, I still can't see how that explains being able to encounter this issue with a single user performing one or two transactional operations at a time, as both @radicaled and I have separately encountered.
@oliverlockwood It would explain the scenario for two transactional operations at a time if there are other unrelated requests being processed at the same time. I'm unable to reproduce the problem of a streaming transaction simply failing in a round robin setup in isolation (i.e. with no other transaction getting in the way).
@pluma
Thanks for your assessment!
I noticed in the documentation that we could disable keepAlive
, and so I did, while also bumping maxSockets
to a value of 40. According to the documentation this that should mean there's only 40 available connections in the pool, and they all become occupied by 1 transaction each in the test case, correct? If I understood correctly, this shouldn't trigger the problematic quirk; forgive me if I haven't understood correctly.
I used this initialization:
const db = require("arangojs")({
url: "http://localhost:8529",
agentOptions: {
maxSockets: 40,
keepAlive: false, // only allow 40 connections for the 40 transactions
keepAliveMsecs: 1000 // default from docs
},
});
Output after 5 minutes, which hasn't changed:
19791241340076 begun
19791241340080 begun
19791241340084 begun
19791241340088 begun
19791241340092 begun
19791241340096 begun
19791241340100 begun
19791241340104 begun
19791241340108 begun
19791241340112 begun
19791241340116 begun
19791241340120 begun
19791241340124 begun
19791241340128 begun
19791241340132 begun
19791241340136 begun
19791241340140 begun
19791241340144 begun
19791241340148 begun
19791241340152 begun
19791241340156 begun
19791241340160 begun
19791241340164 begun
19791241340168 begun
19791241340172 begun
19791241340176 begun
19791241340180 begun
19791241340184 begun
19791241340188 begun
19791241340192 begun
19791241340196 begun
19791241340200 begun
19791241340204 begun
19791241340208 begun
19791241340212 begun
19791241340216 begun
19791241340220 begun
19791241340224 begun
19791241340076 committed
19791241340228 begun
19791241340232 begun
NB: while I was writing this reply, the console updated with these lines:
19791241340080 failed to commit after 240128 ms: ArangoError: timeout in cluster operation
20890752967836 failed to commit after 420808 ms: ArangoError: timeout in cluster operation
(As you can tell, I took my time writing up this reply... :) )
Did I misunderstand, or is dealing with the issue something that requires modifications to arangojs internals for now?
FYI: thanks for being attentive to this issue!
@radicaled I was able to solve the deadlock problem via the idea I hinted at. Namely I've introduced a priority queue for transaction-related requests enqueued while the main queue is at capacity. In order to ensure there's always room for these priority requests, arangojs now reserves at least one socket (unless agentOptions.maxTotalSockets
is set to 1) for priority requests. This change is currently in a dedicated branch and not in the main branch.
This results in this test passing in standalone mode as well as when using round-robin. However in round-robin we were able to observe that every coordinator returns the first begin
, i.e. there will always be three conflicting transactions running in parallel. While the tests did initially pass, we were able to provoke a failure by introducing a trivial delay in each transaction, resulting in timeouts and dropped transactions as expected.
The ArangoDB core team is investigating this behavior. I was able to reproduce this behavior for N coordinators (where N > 1) in round-robin with as few as 2N-1 concurrent transactions. This does not explain why individual transactions would randomly fail when there are no other concurrent transactions but it suggests there is a problem with the behavior of exclusive transactions in multi-coordinator scenarios.
Okay, after discussing this issue with the core team it seems the priority queue was a bit of a red herring. Increasing the maxSockets
makes the problem go away. Also, transactions work differently in a cluster, as documented.
In other words, there seem to be two problems here:
maxSockets
resulted in begin
requests blocking requests related to other transactions, which is a bug in arangojs and will be addressed in the next minor release.agentOptions.maxSockets
needs to be higher if you are using arangojs with a cluster in round robin mode and anticipate multiple parallel transactions. This is user-actionable and the problem will be documented in the general troubleshooting section of the arangojs documentation.I'm leaving this issue open until those two changes have been made.
I've merged the concurrency tests into main
and deleted the test branch.
FYI the change also increases the default value for maxSockets
when using ROUND_ROBIN
with an array of server URLs from 3
to url.length * 3
. This should at least solve the footgun of not changing the default in this scenario for low numbers of concurrent transactions. We still recommend setting this value explicitly tho.
arangojs 7.3.0 has been released, which removes the keepAlive
-specific optimization and changes the maxSockets
default for round robin load balancing.
Hi, I'm using "arangojs": "^7.3.0", (ArangoDB 3.7.2-1 [win64] 64bit I have the same problem ArangoError: transaction '4039091' not found only if I put a delay >= 10 secs What am I doing wrong? Thank you in advance
node app 10
var args = process.argv.slice(2);
var sec = args && args.length > 0 ? parseInt(args[0]) : 5;
console.log("start with: " + sec + " sec");
var Database = require('arangojs').Database;
const database = new Database({
url: "http://127.0.0.1:8529"
});
database.useBasicAuth('username', 'password')
var db = database.database('database-name')
var sleep = require('sleep');
try {
call_fai(sec)
.then(a => {
console.log("end: " + JSON.stringify(a));
});
}
catch (err) {
console.log("error: " + err);
}
function call_fai(sec) {
return new Promise(function (resolve, reject) {
db.beginTransaction(
{
exclusive: ['collection-name']
}
)
.then(trx => {
sleep.sleep(sec);
trx.step(() => col.save({ hello: "world" }));
const result = trx.abort();
resolve(result);
});
});
}
@fangeli transactions can and will time out. I don't know the default transaction/lock timeout off the top of my head but I'm willing to bet that 10 seconds idle time exceeds it (this was also the duration for which we would see transactions get "stuck" in the above tests when the number of sockets was too low and before the preemptive keepalive pipelining had been removed in the recent arangojs release). "Transaction not found" generally either means the transaction has already been comitted/aborted, that it wasn't started in the first place or that it has timed out and been disposed (which seems to be the case in your example).
On a sidenote: If I understand the cluster team correctly, one noteworthy difference in cluster scenarios is that transaction locks will only be acquired when the transaction actually starts using those collections or documents, not when begin
returns. This means timeouts can behave slightly differently in cluster and standalone scenarios, but in either case you will see timeouts if your transactions hang around doing nothing for excessive periods of time.
Currently running with
arangojs
version6.14.0
, a 3-node cluster, andloadBalancingStrategy: 'ROUND_ROBIN'
. (Aside: yes, we are in the process of looking to upgrade toarangojs
version7.x
.)We have a part of our software which makes use of ArangoDB transactions, like:
This works some of the time, but sometimes also throws errors like this:
It seems to me that what is likely happening is:
This seems to me like a fundamental issue with either ArangoDB itself of the ArangoJS library. As far as I can tell, the only options we have right now are:
I'd love to learn of a different way of dealing with this - so I look forward to hearing your thoughts. Many thanks for taking the time to read this issue.