arangodb / arangojs

The official ArangoDB JavaScript driver.
https://arangodb.github.io/arangojs
Apache License 2.0
600 stars 106 forks source link

3.7 streaming transactions: ArangoError: cluster internal HTTP connection broken #699

Closed mikestaub closed 2 years ago

mikestaub commented 3 years ago

This branch was working on 3.6.3, but is now failing on 3.7.3

https://github.com/mikestaub/arangojs/pull/1/files

Would it be possible to include a default Oasis cluster in the integration tests so these types of regressions could be caught earlier?

dothebart commented 3 years ago

Hi, If you make this a PR to arangojs, it will become part of the nightly tests that the ci runs. Thanks for digging deeper into this.

mikestaub commented 3 years ago

@dothebart what 3 URIs should I use for the Oasis cluster in my PR?

dothebart commented 3 years ago

@pluma can you give a hint for this?

pluma commented 3 years ago

@dothebart I've added support for passing multiple URLs with commas via ab866b0. Check the changes to CONTRIBUTING.md in particular.

Note that this will result in acquireHostsList being called, which in my case returns IPv6 URLs which won't be deduplicated if you use an alias like localhost. This also means you can just append a single comma to your TEST_ARANGODB_URL to opt into cluster mode.

Cluster mode always enables round robin.

fceller commented 3 years ago

Hi @mikestaub ,

I hope you are doing well. Alan and WIlli are working on extending the automatic testing to catch these issues more easily.

What is the current status? Is it blocking you from moving to 3.7 or did you work around it?

best Frank

mikestaub commented 3 years ago

@fceller this is blocking me from upgrading to 3.7 but it is not urgent as 3.6 is working well.

dothebart commented 3 years ago

hm, running the tests with 3 coordinators barely doesn't reproduce this. @mikestaub can you sched a bit more details on the environment you're running into this?

mikestaub commented 3 years ago

@dothebart here is the docker-compose.yml file I am using:

version: "3"

services:

  nginx:
    image: nginx:1.17.9
    container_name: arangodb-proxy
    depends_on:
      - arangodb-coordinator1
      - arangodb-coordinator2
      - arangodb-coordinator3
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    ports:
      - 8529:80

  arangodb-coordinator1:
    restart: on-failure
    container_name: arangodb-coordinator1
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.max-number-of-shards 1
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-coordinator1:8529
      --cluster.my-role COORDINATOR
    volumes:
      - arangodb-coordinator1:/var/lib/arangodb3

  arangodb-coordinator2:
    restart: on-failure
    container_name: arangodb-coordinator2
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.max-number-of-shards 1
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-coordinator2:8529
      --cluster.my-role COORDINATOR
    volumes:
      - arangodb-coordinator2:/var/lib/arangodb3

  arangodb-coordinator3:
    restart: on-failure
    container_name: arangodb-coordinator3
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.max-number-of-shards 1
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-coordinator3:8529
      --cluster.my-role COORDINATOR
    volumes:
      - arangodb-coordinator3:/var/lib/arangodb3

  arangodb-agency1:
    restart: on-failure
    container_name: arangodb-agency1
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --foxx.queues false
      --agency.size 3
      --agency.supervision true
      --agency.activate true
      --agency.my-address tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency3:8529
    volumes:
      - arangodb-agency1:/var/lib/arangodb3

  arangodb-agency2:
    restart: on-failure
    container_name: arangodb-agency2
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics false
      --agency.size 3
      --agency.supervision true
      --agency.activate true
      --agency.my-address tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency3:8529
    depends_on:
      - arangodb-agency1
    volumes:
      - arangodb-agency2:/var/lib/arangodb3

  arangodb-agency3:
    restart: on-failure
    container_name: arangodb-agency3
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics false
      --agency.size 3
      --agency.supervision true
      --agency.activate true
      --agency.my-address tcp://arangodb-agency3:8529
      --agency.endpoint tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency3:8529
    depends_on:
      - arangodb-agency1
    volumes:
      - arangodb-agency3:/var/lib/arangodb3

  arangodb-dbserver1:
    restart: on-failure
    container_name: arangodb-dbserver1
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-dbserver1:8529
      --cluster.my-role PRIMARY
      --database.directory /var/lib/arangodb3/primary1
    volumes:
      - arangodb-dbserver1:/var/lib/arangodb3

  arangodb-dbserver2:
    restart: on-failure
    container_name: arangodb-dbserver2
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-dbserver2:8529
      --cluster.my-role PRIMARY
      --database.directory /var/lib/arangodb3/primary2
    volumes:
      - arangodb-dbserver2:/var/lib/arangodb3

  arangodb-dbserver3:
    restart: on-failure
    container_name: arangodb-dbserver3
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-dbserver3:8529
      --cluster.my-role PRIMARY
      --database.directory /var/lib/arangodb3/primary3
    volumes:
      - arangodb-dbserver3:/var/lib/arangodb3

volumes:
  arangodb-agency1:
  arangodb-agency2:
  arangodb-agency3:
  arangodb-dbserver1:
  arangodb-dbserver2:
  arangodb-dbserver3:
  arangodb-coordinator1:
  arangodb-coordinator2:
  arangodb-coordinator3:
dothebart commented 3 years ago

this seems to be missing the nginx config file?

mikestaub commented 3 years ago

this is the nginx.conf

worker_processes 1;

events {
  worker_connections 1024;
}

http {
  upstream arangodb-servers {
    server arangodb-coordinator1:8529;
    server arangodb-coordinator2:8529;
    server arangodb-coordinator3:8529;
  }

  server {
    listen 80;
    location / {
      proxy_pass http://arangodb-servers;
    }
  }
}
dothebart commented 3 years ago

Ok, @ajanikow was able to narrow it down to what actually is the reason. nginx in HTTP-proxy-mode attempts to "fix" empty PUT requests which are used for cursors. Since later on coordinators need to forward the request, they fail to correctly do so.

The instantly working fix is to configure nginx to use tcp-proxy instead of http-proxy by swapping the http section to:

stream  {
  upstream arangodb-servers {
    server arangodb-coordinator1:8529;
    server arangodb-coordinator2:8529;
    server arangodb-coordinator3:8529;
  }
  server {
    listen 80;
    proxy_pass arangodb-servers;
  }
}

We will dig deeper on the real reason later.

mikestaub commented 3 years ago

@dothebart any updates on the root cause? I am seeing these errors in Oasis on v3.7.5 as I assume envoy uses TCP not HTTP.

dothebart commented 3 years ago

Hi, sorry, have been busy with release QA. hm, @ajanikow ensured me that oasis shouldn't have these issues?

ajanikow commented 3 years ago

Hello!

Problem with cluster internal HTTP connection broken in Oasis can be related to different thing. Only TCP forwarding is used on all levels, so issue caused by invalid body should not occur.

Can you create Oasis issue? Then we will be able to look on your Deployment (we will check internal reason).

Best Regards, Adam.

mikestaub commented 3 years ago

Here is the Oasis issue: https://arangodb.atlassian.net/servicedesk/customer/portal/13/OASIS-418

mikestaub commented 3 years ago

@ajanikow I think the long-term solution is to provide a way to run the Oasis cluster locally so I can run my integration tests against it an be confident it will work once deployed. Either with a docker-compose file or k8 helm charts.

dothebart commented 3 years ago

Ok, the current situation is, that ArangoDB will forward all [most] HTTP-headers that it gets from one coordinator to the one that owns the cursor. In your setup case this is too - connection: close. This starts an unwanted chain of reactions, which in current devel somewhere later down the road doesn't lead to the actual error we see.

However, not forwarding the connection header in first place (since the cluster should use connection keep-alive for performance reasons) fixes this problem, without fixing the ultimatively last point where the error occurs.

This bugfix is going to be part of the upcomming 3.7.6 Release.

mikestaub commented 3 years ago

@dothebart great, thanks for tracking down this down. In the meantime can I manually remove that header from the requests being sent by arangojs? Do you have an ETA when 3.7.6 will be available on Oasis?

dothebart commented 3 years ago

at least in your testcase this header is added by the NGINX Proxy - as @ajanikow pointed out, using it in TCP-Mode also circumvents the situation from appearing. The problem is the PUT request without a post-body which makes the nginx go down to HTTP/1.0 - which implies connection: close. As @ajanikow also told - oasis should also have no nginx in http mode - so if there are more problems, these aren't similar to the docker-compose ones.

mikestaub commented 3 years ago

I think it might also be a timing issue in the arangojs task queue. After adding this to my Database config, the issue disappeared:

agentOptions: {
  keepAlive: true,
  keepAliveMsecs: 50000,
  maxSockets: 1, // TODO: remove this
},
dothebart commented 3 years ago

Hi, Happy new Year ;) Its still unclear to me how and why you should be hit by this. Can we have some more details on the total environment? How and from where do you connect the oasis cluster? Is this your local workstation and mabye there a transparent proxy in the way (company network or telco provider?) ? Whats the tcp-traceroute? Does the instance live near to you?

If its all that, is issue reproducible if you use a cloud VM near to your oasis cluster?

mikestaub commented 3 years ago

Happy new year!

The error was still appearing in my local env ( not Oasis ), even with TCP routing enabled. You should be able to reproduce it with that docker-compose file. I assume that docker-compose file is a good approximation of an Oasis setup. I saw the same errors when running on lambda connecting to Oasis.

I actually think it may be a bug in arangojs as it might be firing the HTTP requests in the wrong order ( commit transaction before it was created ).

mikestaub commented 3 years ago

I just confirmed, this issue is still present in 3.7.7

dothebart commented 3 years ago

we probably can meanwhile close this as duplicate of https://github.com/arangodb/arangojs/issues/702#issuecomment-791723325 since it meanwhile contains a more precise description & testcases of the actual ongoing behaviour WDYT?

pluma commented 3 years ago

arangojs 7.3.0 changes behavior related to maxSockets so you may want to try with this version in either case.

mikestaub commented 3 years ago

I still have to set maxSockets=1 with arangojs@7.5.0 and arangodb@3.7.7

dothebart commented 3 years ago

Hi, since all of our changes to improve this haven't finally resolved this situation yet, can you please open a Jira issue with maybe a code sample reproducing this? Please add a reference to this github issue as well.

pluma commented 2 years ago

I'm closing this due to inactivity. Please follow the directions provided above if the problem still persists.