arangodb-helper / arangodb

ArangoDB Starter - starts ArangoDB clusters & single servers with ease.
Apache License 2.0
74 stars 16 forks source link

How to remove a node from the cluster? #76

Open NeillCain-zz opened 7 years ago

NeillCain-zz commented 7 years ago

Hi, we've been trialing this and are currently on version 0.5.0. We had an issue with the underlying data drive on one of the nodes that resulted in us having to take that node down. We had replication on all shards set to 2 which we thought in theory would allow us to provision a new node and re-balance the shards w/out data loss. We have not had success with the re-balance and assume this is due to the node still being present on the web ui overview page, albeit in an errored state -both DbServer and coordinator for that node are listed as SHUTDOWN. Another problem we have is that when we provisioned the new node on the cluster a new agent was not provisioned, presumably as the old one was still 'registered'?

Our question is, how do we remove the erroneous node to allow us to register an agent etc from our new one and re-balance our shard replica across to the new node?

Here is a dump of the arangod.log file from the agent on the lead node:

2017-08-08T11:15:27Z [2872] ERROR {cluster} cannot create connection to server '' at endpoint 'tcp://10.16.1.6:4001' 2017-08-08T11:15:28Z [2872] INFO {supervision} Precondition failed for starting job 12520001

Note that this is continually retrying and has CPU pegged at ~80%

Hoping for some guidance here should this issue arise in future. We intend to start from scratch on a new cluster set up.

Thanks

ewoutp commented 7 years ago

Currently the starter does not play well when it comes to removing a node. We're currently working on (and expect to release soon) an update that will enable you to remove a node completely as long as that node does not contain an agent. (Arangod itself currently does not support removing agents, but this is also a known issue that we want to solve asap).

This update will have a REST api for removing a node. Once called, the starters will work together to:

dashdeep13 commented 7 years ago

I am also facing the same issues using the starter 0.9.0 with Arango 3.2. A fix/resolution strategy for this would be great!

ewoutp commented 7 years ago

The latest starter version already has support for removing a node that does not have an agent on it.

When using the Shutdown method (see https://github.com/arangodb-helper/arangodb/blob/master/client/api.go#L43) with goodbye=true the arangod servers will be removed from the cluster in a controlled manor. This means that when there is a dbserver it is first drained and then removed and a coordinator is also removed from the cluster nicely.

I'm leaving this issue open, because you cannot remove a node (in a controlled manor) that has an agent on it. Reason for that is that the API to make an agent leave in a controlled manor from the agency does not yet exist in arangod.

dashdeep13 commented 7 years ago

Hi @ewoutp Thank you for your reply. I understand that it is possible to remove a node when it is functioning properly. The issue arises when the the node is dead and the agents keeps on pinging it and we get unnecessary logs. Even after removing the "dead" nodes from the UI, the initial cluster node tries to keep pinging the dead guy. On restarting the cluster(after removing it from the UI), the agency endpoint which returns the coordinators has an '' entry (as a replacement for the dead coordinator node) and the UI shows an empty db server(in my case the dead node was both a coordinator and dbserver, so I am mentioning both the issues). Is there a way/playbook to go through to remove dead nodes from the cluster

ewoutp commented 7 years ago

@dashdeep13 you're correct, currently the starter can help you to remove a node when it is still alive and well.

What you're talking about is a cleanup scenario after a node is dead and beyond repair. In this case we'll likely need a force flag on the goodbye request so it ignores failures when trying to talk to the servers on the broken node. I'll discuss this and get back on it.

dashdeep13 commented 7 years ago

@ewoutp Thanks! I feel if the node is down, then it would be difficult to serve a request on that box (even with a force flag :O ). Please let me know regarding this. I am kind of blocked on the dead node , specially the agent one. Also I am curious that in a cluster, since node failures are expected, the agent breaking would be a P1 and would require immediate attention :O . Hopefully we have a fix for this soon :)

ewoutp commented 7 years ago

@dashdeep13 You can use the following HTTP API on one of the coordinators to remove a dead server.

curl http[s]://coordinator-ip/_admin/cluster/removeServer -d '"SERVER_ID"'

You can find the SERVER_ID using the GET /_admin/cluster/health API.

dashdeep13 commented 7 years ago

Thanks! will try this and get back to you. I hade managed to dig out other endpoints from slack: curl -XDELETE :/_admin/shutdown?remove_from_cluster=1 curl :/_admin/cluster/cleanOutServer -d '{"server":"DBServer000X"}'

Unfortunately the above ones did not have any impact on the cluster :( Will try the one that you mentioned :)

seansilvestri commented 7 years ago

@ewoutp I just ran the 'removeServer' endpoint for failed coordinator and dbserver node. I'm still seeing the same exact behavior that I saw when I deleted the failed nodes via the UI. The coordinator removes clean but the dbServer node tab now shows an additional empty entry in the node list (after a re-boot!). Additionally and quite annoyingly when I select a new server to change my shards to it shows an entry with the ID of the missing DbServer. Is there a way to cleanly remove a failed DbServer node? I'm assuming behind the scenes this node is still being polled and filling up my logs... I'm also curious why the setup.json does not get updated when the node is removed? It seems to me that leaving a reference to a node that no longer exists would cause it to still be pinged...

dashdeep13 commented 7 years ago

@ewoutp That did not work for me either :(. Here is what I did:

  1. Bring down a cordinator+dbserver node.
  2. Started seeing node health errors for dbserver and coordinator
  3. Removed the nodes from the UI delete button( also tried removing through the endpoint you mentioned, get a value of true in response, but nodes are still there)
  4. Still getting coordinator not reachable logs in the initial coordinator
  5. stopped the starter, removed setup.json,started the starter
  6. Coordinator error logs disappear, but in the UI an empty DBServer is seen. While selecting shards still see the id of the dead dbserver.
  7. the endpoint to retrieve all coordinator endpoints has an entry '' for one of the coordinators.
  8. ran the _admin/cluster/health endpoint and see an entry for the dead DBServer and it says Status: GOOD, canBeDeleted: false, LastHeartbeatStatus: ""
dashdeep13 commented 6 years ago

Here are my findings: There are two issues which I face: a) Coordinator: Either removing the "dead" coordinator from the UI or using the removeServer endpoint, the coordinator gets removed from the UI and from the output of /_admin/cluster/health. however we still see the node in the output of : /_api/cluster/endpoints and the initial arangodb starter node coordinator still keeps getting "Cannot connect " logs to dead coordinator b) DbServer: Either removing the "dead" Dbserver from the UI or using the removeServer endpoint, the Dbserver gets removed from the UI and from the output of /_admin/cluster/health.

When I restart the arangodb starter after removing its setup.json(in an attempt to remove polling for the dead coordinator), the coordinator logs stop /_api/cluster/endpoints returns an entry with empty endpoint '' even the Shards show an option to send the shard to the dead dbServer.

Is there a backoff time after which the dead coordinator will no longer be contacted? Removing setup.json and restarting causes the starters to go into a corrupted state i.e. showing empty endpoints?

tmd313 commented 6 years ago

Is there still no way to remove a defunct/destroyed agent node? I see that trying to replace it at the same IP that the other agencies/agent nodes 'see' it's config and tell me I can't have a peer in the same directory. It would appear to be from it being 'remembered' by the agency:

2017/10/17 19:23:31 Cannot start because of HTTP error from master: code=400, message=Cannot use same directory as peer.: bad request

Am I without recourse when it comes to replacing a failed agency? If so, I believe I, in most cases, or could otherwise address missing cases, restore the node via it's original identity along with whatever files it had at the time.

ewoutp commented 6 years ago

@tmd313 Removing an agent is still not possible. The good news is that we're very close to merging this feature (into arangod) and it will be in there soon.

ewoutp commented 6 years ago

@dashdeep13 First of all sorry that this is taking so long.

I've managed to reproduce the failure to remove a dead dbserver or coordinator from the cluster. (both using curl with the API's and using the GUI) We're investigating deeper into arangod what is going on.

Hope to have a more useful answer for you soon.

phouverneyuff commented 6 years ago

When I remove one server, I'm getting the following message: ... :/_admin/cluster/removeServer -d '"SERVER_ID"' {"error":true,"code":400,"errorNum":400,"errorMessage":"unhandled role undefined"}

dashdeep13 commented 6 years ago

@phouverneyuff This is a arangodb level issue https://github.com/arangodb/arangodb/blob/3.3/js/actions/api-cluster.js#L77 we are not getting a match on the Role(and the value returned back is role(lowercase)) which is why we are not able to get the actual Value of role what is present

alejom99 commented 6 years ago

I'm having the same problem, any updates on this? thanks

sebastienpattyn93 commented 6 years ago

Any updates on this issue @ewoutp ? I'm having the problem that I can't expand my cluster because a node that was removed before, was using the same internal IP as the node I'm using right now. It gives me the following error as well:

Cannot start because of HTTP error from master: code=400, message=Cannot use same directory as peer.: bad request

ewoutp commented 6 years ago

@sebastienpattyn93 for now you have 2 options:

1) Use a different IP address for the starter (the machine the starter is running on) 2) Use a different data directory

We're looking into improving the starter API to deal with these cases better.

sebastienpattyn93 commented 6 years ago

@ewoutp using a different data directory fixed it for the moment. although de ArangoDB starter is not started on the same port anymore (8533 instead of 8528). Thank you for the workaround.