Closed mikestaub closed 2 years ago
Hi, If you make this a PR to arangojs, it will become part of the nightly tests that the ci runs. Thanks for digging deeper into this.
@dothebart what 3 URIs should I use for the Oasis cluster in my PR?
@pluma can you give a hint for this?
@dothebart I've added support for passing multiple URLs with commas via ab866b0. Check the changes to CONTRIBUTING.md
in particular.
Note that this will result in acquireHostsList
being called, which in my case returns IPv6 URLs which won't be deduplicated if you use an alias like localhost
. This also means you can just append a single comma to your TEST_ARANGODB_URL
to opt into cluster mode.
Cluster mode always enables round robin.
Hi @mikestaub ,
I hope you are doing well. Alan and WIlli are working on extending the automatic testing to catch these issues more easily.
What is the current status? Is it blocking you from moving to 3.7 or did you work around it?
best Frank
@fceller this is blocking me from upgrading to 3.7 but it is not urgent as 3.6 is working well.
hm, running the tests with 3 coordinators barely doesn't reproduce this. @mikestaub can you sched a bit more details on the environment you're running into this?
@dothebart here is the docker-compose.yml file I am using:
version: "3"
services:
nginx:
image: nginx:1.17.9
container_name: arangodb-proxy
depends_on:
- arangodb-coordinator1
- arangodb-coordinator2
- arangodb-coordinator3
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
ports:
- 8529:80
arangodb-coordinator1:
restart: on-failure
container_name: arangodb-coordinator1
image: arangodb/arangodb:3.7.3
environment:
- ARANGO_NO_AUTH=1
command: >
--server.endpoint tcp://0.0.0.0:8529
--server.authentication false
--server.statistics true
--cluster.max-number-of-shards 1
--cluster.min-replication-factor 3
--cluster.max-replication-factor 3
--cluster.force-one-shard true
--cluster.agency-endpoint tcp://arangodb-agency1:8529
--cluster.agency-endpoint tcp://arangodb-agency2:8529
--cluster.agency-endpoint tcp://arangodb-agency3:8529
--cluster.my-address tcp://arangodb-coordinator1:8529
--cluster.my-role COORDINATOR
volumes:
- arangodb-coordinator1:/var/lib/arangodb3
arangodb-coordinator2:
restart: on-failure
container_name: arangodb-coordinator2
image: arangodb/arangodb:3.7.3
environment:
- ARANGO_NO_AUTH=1
command: >
--server.endpoint tcp://0.0.0.0:8529
--server.authentication false
--server.statistics true
--cluster.max-number-of-shards 1
--cluster.min-replication-factor 3
--cluster.max-replication-factor 3
--cluster.force-one-shard true
--cluster.agency-endpoint tcp://arangodb-agency1:8529
--cluster.agency-endpoint tcp://arangodb-agency2:8529
--cluster.agency-endpoint tcp://arangodb-agency3:8529
--cluster.my-address tcp://arangodb-coordinator2:8529
--cluster.my-role COORDINATOR
volumes:
- arangodb-coordinator2:/var/lib/arangodb3
arangodb-coordinator3:
restart: on-failure
container_name: arangodb-coordinator3
image: arangodb/arangodb:3.7.3
environment:
- ARANGO_NO_AUTH=1
command: >
--server.endpoint tcp://0.0.0.0:8529
--server.authentication false
--server.statistics true
--cluster.max-number-of-shards 1
--cluster.min-replication-factor 3
--cluster.max-replication-factor 3
--cluster.force-one-shard true
--cluster.agency-endpoint tcp://arangodb-agency1:8529
--cluster.agency-endpoint tcp://arangodb-agency2:8529
--cluster.agency-endpoint tcp://arangodb-agency3:8529
--cluster.my-address tcp://arangodb-coordinator3:8529
--cluster.my-role COORDINATOR
volumes:
- arangodb-coordinator3:/var/lib/arangodb3
arangodb-agency1:
restart: on-failure
container_name: arangodb-agency1
image: arangodb/arangodb:3.7.3
environment:
- ARANGO_NO_AUTH=1
command: >
--server.endpoint tcp://0.0.0.0:8529
--server.authentication false
--foxx.queues false
--agency.size 3
--agency.supervision true
--agency.activate true
--agency.my-address tcp://arangodb-agency1:8529
--agency.endpoint tcp://arangodb-agency1:8529
--agency.endpoint tcp://arangodb-agency2:8529
--agency.endpoint tcp://arangodb-agency3:8529
volumes:
- arangodb-agency1:/var/lib/arangodb3
arangodb-agency2:
restart: on-failure
container_name: arangodb-agency2
image: arangodb/arangodb:3.7.3
environment:
- ARANGO_NO_AUTH=1
command: >
--server.endpoint tcp://0.0.0.0:8529
--server.authentication false
--server.statistics false
--agency.size 3
--agency.supervision true
--agency.activate true
--agency.my-address tcp://arangodb-agency2:8529
--agency.endpoint tcp://arangodb-agency1:8529
--agency.endpoint tcp://arangodb-agency2:8529
--agency.endpoint tcp://arangodb-agency3:8529
depends_on:
- arangodb-agency1
volumes:
- arangodb-agency2:/var/lib/arangodb3
arangodb-agency3:
restart: on-failure
container_name: arangodb-agency3
image: arangodb/arangodb:3.7.3
environment:
- ARANGO_NO_AUTH=1
command: >
--server.endpoint tcp://0.0.0.0:8529
--server.authentication false
--server.statistics false
--agency.size 3
--agency.supervision true
--agency.activate true
--agency.my-address tcp://arangodb-agency3:8529
--agency.endpoint tcp://arangodb-agency1:8529
--agency.endpoint tcp://arangodb-agency2:8529
--agency.endpoint tcp://arangodb-agency3:8529
depends_on:
- arangodb-agency1
volumes:
- arangodb-agency3:/var/lib/arangodb3
arangodb-dbserver1:
restart: on-failure
container_name: arangodb-dbserver1
image: arangodb/arangodb:3.7.3
environment:
- ARANGO_NO_AUTH=1
command: >
--server.endpoint tcp://0.0.0.0:8529
--server.authentication false
--server.statistics true
--cluster.min-replication-factor 3
--cluster.max-replication-factor 3
--cluster.force-one-shard true
--cluster.agency-endpoint tcp://arangodb-agency1:8529
--cluster.agency-endpoint tcp://arangodb-agency2:8529
--cluster.agency-endpoint tcp://arangodb-agency3:8529
--cluster.my-address tcp://arangodb-dbserver1:8529
--cluster.my-role PRIMARY
--database.directory /var/lib/arangodb3/primary1
volumes:
- arangodb-dbserver1:/var/lib/arangodb3
arangodb-dbserver2:
restart: on-failure
container_name: arangodb-dbserver2
image: arangodb/arangodb:3.7.3
environment:
- ARANGO_NO_AUTH=1
command: >
--server.endpoint tcp://0.0.0.0:8529
--server.authentication false
--server.statistics true
--cluster.min-replication-factor 3
--cluster.max-replication-factor 3
--cluster.force-one-shard true
--cluster.agency-endpoint tcp://arangodb-agency1:8529
--cluster.agency-endpoint tcp://arangodb-agency2:8529
--cluster.agency-endpoint tcp://arangodb-agency3:8529
--cluster.my-address tcp://arangodb-dbserver2:8529
--cluster.my-role PRIMARY
--database.directory /var/lib/arangodb3/primary2
volumes:
- arangodb-dbserver2:/var/lib/arangodb3
arangodb-dbserver3:
restart: on-failure
container_name: arangodb-dbserver3
image: arangodb/arangodb:3.7.3
environment:
- ARANGO_NO_AUTH=1
command: >
--server.endpoint tcp://0.0.0.0:8529
--server.authentication false
--server.statistics true
--cluster.min-replication-factor 3
--cluster.max-replication-factor 3
--cluster.force-one-shard true
--cluster.agency-endpoint tcp://arangodb-agency1:8529
--cluster.agency-endpoint tcp://arangodb-agency2:8529
--cluster.agency-endpoint tcp://arangodb-agency3:8529
--cluster.my-address tcp://arangodb-dbserver3:8529
--cluster.my-role PRIMARY
--database.directory /var/lib/arangodb3/primary3
volumes:
- arangodb-dbserver3:/var/lib/arangodb3
volumes:
arangodb-agency1:
arangodb-agency2:
arangodb-agency3:
arangodb-dbserver1:
arangodb-dbserver2:
arangodb-dbserver3:
arangodb-coordinator1:
arangodb-coordinator2:
arangodb-coordinator3:
this seems to be missing the nginx config file?
this is the nginx.conf
worker_processes 1;
events {
worker_connections 1024;
}
http {
upstream arangodb-servers {
server arangodb-coordinator1:8529;
server arangodb-coordinator2:8529;
server arangodb-coordinator3:8529;
}
server {
listen 80;
location / {
proxy_pass http://arangodb-servers;
}
}
}
Ok,
@ajanikow was able to narrow it down to what actually is the reason.
nginx in HTTP-proxy-mode attempts to "fix" empty PUT
requests which are used for cursors.
Since later on coordinators need to forward the request, they fail to correctly do so.
The instantly working fix is to configure nginx to use tcp-proxy instead of http-proxy by swapping the http
section to:
stream {
upstream arangodb-servers {
server arangodb-coordinator1:8529;
server arangodb-coordinator2:8529;
server arangodb-coordinator3:8529;
}
server {
listen 80;
proxy_pass arangodb-servers;
}
}
We will dig deeper on the real reason later.
@dothebart any updates on the root cause? I am seeing these errors in Oasis on v3.7.5 as I assume envoy uses TCP not HTTP.
Hi, sorry, have been busy with release QA. hm, @ajanikow ensured me that oasis shouldn't have these issues?
Hello!
Problem with cluster internal HTTP connection broken
in Oasis can be related to different thing. Only TCP forwarding is used on all levels, so issue caused by invalid body should not occur.
Can you create Oasis issue? Then we will be able to look on your Deployment (we will check internal reason).
Best Regards, Adam.
Here is the Oasis issue: https://arangodb.atlassian.net/servicedesk/customer/portal/13/OASIS-418
@ajanikow I think the long-term solution is to provide a way to run the Oasis cluster locally so I can run my integration tests against it an be confident it will work once deployed. Either with a docker-compose file or k8 helm charts.
Ok, the current situation is, that ArangoDB will forward all [most] HTTP-headers that it gets from one coordinator to the one that owns the cursor.
In your setup case this is too - connection: close
. This starts an unwanted chain of reactions, which in current devel somewhere later down the road doesn't lead to the actual error we see.
However, not forwarding the connection
header in first place (since the cluster should use connection keep-alive for performance reasons) fixes this problem, without fixing the ultimatively last point where the error occurs.
This bugfix is going to be part of the upcomming 3.7.6 Release.
@dothebart great, thanks for tracking down this down. In the meantime can I manually remove that header from the requests being sent by arangojs? Do you have an ETA when 3.7.6 will be available on Oasis?
at least in your testcase this header is added by the NGINX Proxy - as @ajanikow pointed out, using it in TCP-Mode also circumvents the situation from appearing.
The problem is the PUT
request without a post-body which makes the nginx go down to HTTP/1.0 - which implies connection: close
.
As @ajanikow also told - oasis should also have no nginx in http mode - so if there are more problems, these aren't similar to the docker-compose
ones.
I think it might also be a timing issue in the arangojs task queue. After adding this to my Database config, the issue disappeared:
agentOptions: {
keepAlive: true,
keepAliveMsecs: 50000,
maxSockets: 1, // TODO: remove this
},
Hi, Happy new Year ;) Its still unclear to me how and why you should be hit by this. Can we have some more details on the total environment? How and from where do you connect the oasis cluster? Is this your local workstation and mabye there a transparent proxy in the way (company network or telco provider?) ? Whats the tcp-traceroute? Does the instance live near to you?
If its all that, is issue reproducible if you use a cloud VM near to your oasis cluster?
Happy new year!
The error was still appearing in my local env ( not Oasis ), even with TCP routing enabled. You should be able to reproduce it with that docker-compose file. I assume that docker-compose file is a good approximation of an Oasis setup. I saw the same errors when running on lambda connecting to Oasis.
I actually think it may be a bug in arangojs as it might be firing the HTTP requests in the wrong order ( commit transaction before it was created ).
I just confirmed, this issue is still present in 3.7.7
we probably can meanwhile close this as duplicate of https://github.com/arangodb/arangojs/issues/702#issuecomment-791723325 since it meanwhile contains a more precise description & testcases of the actual ongoing behaviour WDYT?
arangojs 7.3.0 changes behavior related to maxSockets
so you may want to try with this version in either case.
I still have to set maxSockets=1 with arangojs@7.5.0 and arangodb@3.7.7
Hi, since all of our changes to improve this haven't finally resolved this situation yet, can you please open a Jira issue with maybe a code sample reproducing this? Please add a reference to this github issue as well.
I'm closing this due to inactivity. Please follow the directions provided above if the problem still persists.
This branch was working on 3.6.3, but is now failing on 3.7.3
https://github.com/mikestaub/arangojs/pull/1/files
Would it be possible to include a default Oasis cluster in the integration tests so these types of regressions could be caught earlier?