cvmfs-contrib / cvmfs-hastratum1

Scripts for managing a Highly Available CVMFS Stratum1 pair of machines
Other
0 stars 3 forks source link

Update wiki for pacemaker instead of heartbeat #2

Open DrDaveD opened 3 years ago

DrDaveD commented 3 years ago

heartbeat is no longer supported in el7. Update the wiki page here to use pacemaker instead of heartbeat.

ffurano commented 2 years ago

Hi, I am setting all this up under CS8, and I wrote a new plugin based on the "Dummy" plugin, in order to invoke the script on failover. If you like I can share Fabrizio

DrDaveD commented 2 years ago

I am interested in hearing some details at least, yes.

ffurano commented 2 years ago

Well, beside learming to use pcs, the difficulty was to invoke the script on failover. Pacemaker does not give a simple hook to do that. The only clean way I found to do that was to write a new little plugin that does it and logs something meaningful.

You can see that in my pcs config. Of course aside of this there are a few rules that force the three resources to be run together and in a certain ordering.

# pcs status
Cluster name: cvmfs-ha
Cluster Summary:
  * Stack: corosync
  * Current DC: cvmfsdata20-4a9ba4e16a.cern.ch (version 2.1.0-8.el8-7c3f660707) - partition with quorum
  * Last updated: Mon Nov  1 20:39:35 2021
  * Last change:  Fri Oct 29 11:53:32 2021 by root via cibadmin on cvmfsdata20-4a9ba4e16a.cern.ch
  * 1 node configured
  * 3 resource instances configured

Node List:
  * Online: [ cvmfsdata20-4a9ba4e16a.cern.ch ]

Full List of Resources:
  * ClusterIP   (ocf::heartbeat:IPaddr2):    Started cvmfsdata20-4a9ba4e16a.cern.ch
  * ClusterIPv6 (ocf::heartbeat:IPaddr2):    Started cvmfsdata20-4a9ba4e16a.cern.ch
  * cvmfsfailover   (ocf::heartbeat:CVMFSFailover):  Started cvmfsdata20-4a9ba4e16a.cern.ch

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

There's only one machine here, as I was hiccuping with the other ticket I had opened. Tomorrow I'll install the second machine.

Fabrizio

DrDaveD commented 2 years ago

Once you have it fully working please share all the details. I would then try to use it for my own el7 development cluster and update the wiki accordingly. At RAL they used my yum repo of el6 heartbeat packages on el7 since that was the simplest way to upgrade from el6 and it worked. The other two sites using cvmfs-hastratum1, FNAL and IHEP, are doing only manual failover between the two servers. Do you want to use it at CERN?

At UNL we're not using this package at all. Instead we do a DNS switch between the two stratum 1 servers, which the cvmfs client manages by noticing the IP address change and re-loading catalogs. A DNS switch should be especially easy to do at CERN I think. The switch is currently done manually at UNL but I would like to automate it. One way to do it would be via pacemaker. I am using pacemaker with an automatic DNS switch on wlcg-squid-monitor.cern.ch and frontier.cern.ch web services.

The main advantage of a DNS switch over the cvmfs-hastratum1 package is that there are less delays for updates. The cvmfs-hastratum1 package does not make an update available to clients until a snapshot has successfully replicated from the primary to the backup server (unless it is running in standalone mode, where a failure is allowed). The main disadvantage of a DNS switch is that both stratum 1s replicate from the stratum 0 independently. When I first developed the package it was a requirement to not replicate to 2 stratum 1s, but I don't think people really care much about that today.

ffurano commented 2 years ago

The test service is almost working now, and I'll continue with the setup of the big machines we need. Once it's close to prod (which means that we'll have to finish ironing the rough edges and validate it) I'll be happy to share the details.

For the records, this setup uses pacemaker to bounce an 'ha' IP address between two machines that act as backend for the CVMFS service at CERN. This is done by attaching the IP address to an additional network intf that the two machines have been given.