Seagate / cortx-hare

CORTX Hare configures Motr object store, starts/stops Motr services, and notifies Motr of service and device faults.
https://github.com/Seagate/cortx
Apache License 2.0
13 stars 80 forks source link

WIP: Problem: node replacement procedure is not clear #1120

Closed typeundefined closed 4 years ago

typeundefined commented 4 years ago

Solution: document the current understanding and synchronize with the stakeholders.

ypise commented 4 years ago

I discussed with @mandar.sawant that overall these steps would be necessary for node replacement:

  1. Separate ees-ha and hare
  2. Generate single node CDF and bootstrap
  3. Configure components
  4. hctl shutdown
  5. HA recovery
ypise commented 4 years ago

That would be the idea.
So ideally we would want the files to be stored under /opt/seagate/cortx.

I discussed with @mandar.sawant yesterday and concurred that a section in setup.yaml as under, might help components publish their critical config files:

critical_configs:
  files:
    - file-1
    - file-2

This setup.yaml file section would be read for each component to learn what files are to be captured/restored but HA/Provisioner.

ypise commented 4 years ago

This capture of snapshot has to be periodic or driven by certain events. We need to capture the config on failed node in order to restore it on node being replaced. Thus, each component config has to be dumped on the drive and synced, either by HA or Provisioner.

azheregelya commented 4 years ago

"no failback can be possible"

You mean unstandby?

azheregelya commented 4 years ago

Is some /var subfolder an option here?
Anyway we will have a node that will orchestrate the replace process.
Simplest way is to store them on that node. Then, archives will be ssh-copied to new (fresh) node.

azheregelya commented 4 years ago

Provisioner is invoked on chosen master node for deploy. There we have cluster.sls and storage_enclosure.sls filled.
Need to ensure that provisioner can be run on any other node in case when master node is being replaced.
I would not expect troubles here since salt pillars are supposed to be already initialized. However make sense to clarify.

mssawant commented 4 years ago

assigned to @mandar.sawant

mssawant commented 4 years ago

Epic for node replacement task.

mssawant commented 4 years ago

at our level (provisioner and other components) we don't care. the decision will be relayed to provisioner (that will be acting as an orchestrator of the node replacement) by CSM.

@ajay.srivastava, @pranay.kumar Do we have a task for this regarding the interface how this will be done? Else we need to create one.

mssawant commented 4 years ago

Okay, Thanks for the input @ivan.poddubnyy. I see some integration work with CSM here. I think we are good to define tasks for now.

3bf195dd-69f5-4d3e-a81b-c1b591320bb2 commented 4 years ago

- Node failure reported, identify if node has to be replaced, how? who will do it? RAS?

at our level (provisioner and other components) we don't care. the decision will be relayed to provisioner (that will be acting as an orchestrator of the node replacement) by CSM.

the rest of the process looks ok to me.

mssawant commented 4 years ago

Agree. So following points,

If cannot use controller storage for /var/lib, then

Backup node

 - use rsync, schedule a cron job, backs up `/var/lib` and any other files if required
 - cron job has to be created as part of the provisioning on each node.

Remove node

 - Node failure reported, identify if node has to be replaced, how? who will do it? RAS?
 - Node identified to be replaced
 - `pcs cluster node remove <node-name>` - (may be through CSM or directly on the cluster node using pcs commands)
      - Cluster fails over (if not already failed over), runs on other node.

Restore node

 - Provisioner provisions a new node with the same ipaddress and hostname as the earlier one.
 - Node is replaced, provisioner runs a restore script to restore from the backup location.
   - e.g. node-restore.sh <backup-location> (restores local node) 
 - `pcs cluster start <node-name>`

Please add anything if missing.

cc @konstantin.nekrasov, @andriy.tkachuk, @vvv, @ivan.poddubnyy,

3bf195dd-69f5-4d3e-a81b-c1b591320bb2 commented 4 years ago

I suggest utilizing backup regardless of whether the backup will be used for recovery or not.

mssawant commented 4 years ago

Tried node removal and replacement with pacemaker, did not touch /var/lib, so if it is restored as earlier it should work.

On node removal /var/lib/pacemaker/cib is cleared but restored from peer node when the pacemaker cluster starts on srvnode-1. Still we can use pcs config backup <file-name>. This creates a tarball mainly including cib.xml and corosync.conf.

3bf195dd-69f5-4d3e-a81b-c1b591320bb2 commented 4 years ago

@ivan.poddubnyy, controller failure means losing local access to the devices from that node. Which also means losing IO functionality. But we handle controller failure by failing over in EES due to multipath. Anyway if both the controllers fail then cluster is effectively not usable. Performance part we need to check if it is really affected. May be let's not boot OS itself but I think let's evaluate atleast /var/lib possibility.

I'm ok with evaluating it. The problem is this: is we lose access to the enclosure (from the server), we may cause the server to stall. /var/lib is critical piece of OS. Losing access to it may (and will) cause issues. It needs to be evaluated.

@ivan.poddubnyy, will /opt/seagate/eos-prvsnr/pillar/components/cluster.sls be restored on node replacement in case salt master node itself fails? Also I think the new node will be provisioned first and then it would be added to cluster right? what all provisioning steps will be executed in this case?

cluster.sls will be restored. we'll perform limited provisioning by installing all required rpms, copying as much configuration from the other nodes as possible, and then completing the recovery. The reason it sounds so high-level is because we expect the components to deal with own recovery (or at least tell us what should be done in order to recover a component). From HA perspective, I expect the new node to join the existing cluster instead of performing full HA install/config (this way we'll definitely end up with split brain).

mssawant commented 4 years ago

@vvv , this would be useful if we cannot use controller storage for /var/lib and need to archive. Along with Consul configuration we need to archive ldap, elasticsearch, hare (build ees ha args files, cluster.yaml, confd.xc, consul-env) atleast, I was thinking about pacemaker too, but it can be handled as https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_managing_nodes_in_a_corosync_based_cluster.html (I am trying this).

Backing up pacemaker configuration

@ivan.poddubnyy, will /opt/seagate/eos-prvsnr/pillar/components/cluster.sls be restored on node replacement in case salt master node itself fails? Also I think the new node will be provisioned first and then it would be added to cluster right? what all provisioning steps will be executed in this case?

mssawant commented 4 years ago

@vvv, please check row 30 in Hare planning document

vvv commented 4 years ago

I think it would be better to export Consul configuration, similar to taking snapshots periodically and storing it on some shared storage.

Consul provides a command for this; see Datacenter Backups guide:

Consul provides the snapshot command which can be run using the CLI or the API. The snapshot command saves a point-in-time snapshot of the state of the Consul servers which includes, but is not limited to:

With Consul Enterprise, the snapshot agent command runs periodically and writes to local or remote storage (such as Amazon S3).

cc @mandar.sawant, @ivan.poddubnyy, @andriy.tkachuk

vvv commented 4 years ago

Yes, its a M5 task.

I have no idea what “M5” is, what tasks it is made of. The comment would be more useful and make sense to wider audience if ‘M5’ was hyper-linked to the relevant document.

We have a wiki page that describes milestones up to M4. @mandar.sawant Is there anything similar for M5?

mssawant commented 4 years ago

@vvv, Yes, its a M5 task.

vvv commented 4 years ago

@mandar.sawant This node replacement discussion is EES specific, right?

vvv commented 4 years ago

Can you take this topic in next ARC meeting. [...]

@mandar.sawant I'm not an owl, you know.

Anyway, let me read this discussion...

mssawant commented 4 years ago

@ivan.poddubnyy, controller failure means losing local access to the devices from that node. Which also means losing IO functionality. But we handle controller failure by failing over in EES due to multipath. Anyway if both the controllers fail then cluster is effectively not usable. Performance part we need to check if it is really affected. May be let's not boot OS itself but I think let's evaluate atleast /var/lib possibility.

@vvv , Can you take this topic in next ARC meeting. If not agreed upon then we would have to go with rsync option.

3bf195dd-69f5-4d3e-a81b-c1b591320bb2 commented 4 years ago

when I proposed booting OS from the enclosure it was very much opposed by all teams. the main reason was controller failure will render the server unusable. the 2nd reason was performance implications.

moving /var/lib/ is technically possible. but the problem needs to be validated by the other teams. the concerns will be the same: failure of the controller will render the system unusable and performance.

mssawant commented 4 years ago

@ivan.poddubnyy, do you think moving all these configurations and databases (consul, ldap, etc) to some shared storage is better than having them on local nodes?

Having everything on controller storage is the ideal thing, that was my earlier question. If not OS, moving atleast /var/lib from local node storage to the controller storage like /var/merowould eliminate the need for archival all together. @ivan.poddubnyy, how complex or easy is the task to move /var/lib or OS to controller storage?

Moving /var/lib

Or, booting from a separate boot partition on the controller storage, this will need more work probably.

andriytk commented 4 years ago

Sorry, I don't understand why we need rsync. We better keep the whole OS on the controller storage then - no need to archive or restore anything in this case.

mssawant commented 4 years ago

Actually not from mkfs point of view, but some components may require periodic archival, like Consul and ldap. Just that this should not affect any regular operations like IO similar to side effect that core dumping was causing (though it was a much heavier operation). But let's consider /var/mero for now, we can look into an alternate shared storage option in parallel.

Apart from location of the archival, I think we can create a pacemaker resource (implemented via a custom resource agent) that runs rsync on the required folders and files that are required to be backed up on the shared storage. Same resource can be used to restore the files and folders once a node is replaced.

Similarly, in an environment without pacemaker, instead of a resource agent a service registered with Consul can do this job.

What do you think @andriy.tkachuk, @ivan.poddubnyy?

andriytk commented 4 years ago

mkfs of /var/mero is not a recovery - it is a disaster...

It can be considered as a "recovery" only in some test environments, not in production.

3bf195dd-69f5-4d3e-a81b-c1b591320bb2 commented 4 years ago

while I'd agree that using /var/mero is a straight-forward choice, I suggest that we should estimate an impact on size of the partition (it might be small, but we need to note it) and what happens if Mero crashes - the current recovery process calls for complete reformat (mkfs) of /var/mero, thus any other data will be lost as well.

andriytk commented 4 years ago

but it should be possible to access the storage from both the nodes similar to /var/mero.

@mandar.sawant, each node has its own /var/mero volume, each volume can be accessed from one node only at the same time. The synchronisation is done implicitly by Pacemaker via the hax-c1/2 resource.

Even if it is a shared storage, we can still create separate backup directories for each node.

But you cannot mount the same volume from both nodes at the same time. You should mount it one after another or you should have separate volume for each node. In any case - you need some synchronisation mechanism implemented.

That's why it's just easier to use the existing /var/mero volumes - it's already implemented.

a9bdc23b-3e00-4df4-965d-d59ba52c85ab commented 4 years ago

@mandar.sawant replication will restore data only if one of the node is still active and working, if both node goes down we need to take backup of ldap data and recover from it. This will mostly apply to other components like rabbitmq/elastic search @ajay.srivastava

mssawant commented 4 years ago

@andriy.tkachuk, we can keep it separate for each node but it should be possible to access the storage from both the nodes similar to /var/mero.

OR

Just save configurations and databases from each node in a separate archive and store it on the shared storage. Even if it is a shared storage, we can still create separate backup directories for each node.

andriytk commented 4 years ago

@mandar.sawant how will you synchronise the usage of this shared storage from both the nodes?

mssawant commented 4 years ago

@andriy.tkachuk, I think we should provision a separate dedicated shared storage (instead of /var/mero) to store the archival as some of the databases or configurations (ldap, consul) may keep changing and need to be archived periodically. This should not affect IO path if /var/mero is also used to backup configurations and databases. Why not move all such databases and configurations itself to a shared storage than storing them locally in /var/lib?

andriytk commented 4 years ago

@mandar.sawant that's the purpose of the archival - to avoid that "regeneration" (which might be not a trivial task when we need to preserve the user's data also).

As for the place where to save the configs archive - this is one of the advantages of saving each node configs at its own /var/mero that in case of two nodes failures we still could replace them both without losing anything.

mssawant commented 4 years ago

@ajay.srivastava, sounds good. So from CSM side it seems we don't have any local configuration that needs to be archived as such.

@yashodhan.pise, @ivan.poddubnyy, provisioner configuration, mainly updated cluster.sls is present only on srvnode-1 and is not replicated to srvnode-2 AFAIK. Can we restore cluster.sls during node replacement in case srvnode-1 fails?

Also /var/lib I think most of the configuration for /var/lib/hare can be regenerated during node provisioning (cc @andriy.tkachuk), although I think there is some work needed to regenerate without destroying the old state, especially old /var/mero.

I was wondering what if both the nodes fail (not necessarily at the same time but before the failed node is replaced ) and need to be replaced? In this case we cannot restore from the peer node as well. We would need to replace the nodes and restore the configurations without destroying /var/mero. I think it would be better to export consul configuration, similar to taking snapshots periodically and storing it on some shared storage.

@basavaraj.kirunge, @ajay.srivastava, what about ldap data base, it is still stored locally and replicated right? what if we need to replace both the nodes?

@ivan.poddubnyy, do you think moving all these configurations and databases (consul, ldap, etc) to some shared storage is better than having them on local nodes? Anyway even if we want to archive the configurations we would need to save it on some shared storage.

Assumption: As part of node failure its local storage as well is not fit to reuse in the new node.

5ad893b4-d816-4ebd-8516-103007a0a118 commented 4 years ago

@mandar.sawant CSM stores most of the configurations in Consul but there are some configurations which are required even when Consul is not reachable. Two types of configurations -

  1. Stored in Consul - No issue.
  2. Stored in file, e.g. support bundle config or consul vip itself.

For second case, the configs are stored in csm.conf on local filesystem. Since csm user has passwordless ssh configured for both the nodes, the configs are replicated to other node by the code whenever the config is added/updated. The plan is to use same mechanism to retrieve the configs from csm.conf.

mssawant commented 4 years ago

@ajay.srivastava, Could you please elaborate how CSM will retrieve its latest configuration from other node and how and where does it store now?

We mainly need to understand if there is such a configuration that is stored in a way that is not replicated to other node and is stored locally. Such a configuration needs to be archived or moved to some other shared or replicated storage. Otherwise we would need to periodically archive it (if it changes)so that it can be restored on node replacement. We cannot wait to archive it until the node fails, just because it may not be possible then.

Is this one time archiving after installation & configuration or it's being done periodically ?

If this configuration is modified periodically then it needs to archived periodically else one time, thus it depends on the nature of the configuration.

Where is the archive stored ?

Rather than archiving I would prefer to move such a configuration to a shared or replicated storage. But even the archive needs to be on a shared or replicated storage.

How is trigger provided to component for archiving and retrieving config ?

If the configuration does not changes at runtime then it can be archived as a final step of the component provisioning else on every modification, this can be done automatically via a consul watcher. And restored as required, in this case on node provisioning before replacement.

But again, if this configuration is already on shared or replicated storage then we need not archive it. So please provide such details. Thank you.

5ad893b4-d816-4ebd-8516-103007a0a118 commented 4 years ago

For CSM, current plan is to retrieve latest configuration from other node. Have queries before we can decide to use archiving -

  1. Is this one time archiving after installation & configuration or it's being done periodically ?
  2. Where is the archive stored ?
  3. How is trigger provided to component for archiving and retrieving config ?
d700d035-69cb-4591-95a2-fddbb58bf1f9 commented 4 years ago

@ajay.srivastava please check on the query from Mandar

mssawant commented 4 years ago

@basavaraj.kirunge, great that means we don't need to archive any type of configuration for these components. @pranay.kumar, Can you please provide some information regarding CSM? do we need to archive any specific configuration data for CSM in case of node replacement? or everything will be taken care of during provisioning?

a9bdc23b-3e00-4df4-965d-d59ba52c85ab commented 4 years ago

@mandar.sawant S3 requires openldap, rabbitmq and elastic search configuration and replication to be done, which should be taken care during provisioning step, so no additional step required only need to validate above after node replaced with a sanity check.

mssawant commented 4 years ago

@vvv, yeah right.

@pranay.kumar, @basavaraj.kirunge, Do we need to archive any of the LDAP, Elasticsearch, CSM, RabbitMQ, statsd, UDS, S3 configuration for node replacement? I am sure Consul is replicated; I believe LDAP, Elasticsearch and RabbitMQ are replicated too. Please confirm.

Regarding credential and other configurations, they should be part of the component provisioning, so as new node is provisioned, such configurations should be automatically setup, please confirm. Please add additional contacts if required.

vvv commented 4 years ago

Checking with other component leads.

@mandar.sawant You can CC their GitLab user names in a comment to this MR. This will keep discussion in context (and send notification to their e-mails).

And if you cannot find some user name — let me know and I'll add them. It won't take a minute to add a colleague to GitLab/GitHub. :wink:

mssawant commented 4 years ago

We need a list of components that actually require archiving. Checking with other component leads. It would help us to understand the size of the configuration archive and if Consul is a feasible choice for the same. Also would help to write the helper scripts accordingly.

typeundefined commented 4 years ago

cc @yashodhan.pise, @ivan.poddubnyy

vvv commented 4 years ago

@konstantin.nekrasov Makes sense. Thanks for explaining.

typeundefined commented 4 years ago

@vvv et al: please note that I am even not going to merge this document to RFC.

The purpose of this MR is:

  1. Quickly create a diagram to show the overall process and desired changes
    • Since the diagram is in plantuml, it can be easily read and modified by anyone
  2. Highlight some unclear things and state the questions
  3. Identify the expectations and understand who does what.

MR is good just because these ideas and questions can be discussed via MR discussions. At the very end we can come up with a different document which will be uploaded to the proper place. And I'm not really sure that RFC is a suitable format for it.

vvv commented 4 years ago

TWIMC, to view the rendered diagram, go to ‘Changes’ and click ‘View file @ ...’. :eyes:

vvv commented 4 years ago

This is noise, I don't think we should add it.