ClusterLabs / anvil

The Anvil! Intelligent Availability™ Platform, mark 3
5 stars 6 forks source link

Fence levels need to be checked/reconfigured independent of stonith configs #522

Closed digimer closed 5 months ago

digimer commented 10 months ago

Currently, fence levels are only updated if a fence device changes. A case has been seen where the fence levels were missing and not repaired because the actual stonith device configs were fine.

digimer commented 10 months ago

Look in Cluster.pm->check_stonith_config() around;

    # Setup fence levels.
    foreach my $node_name (sort {$a cmp $b} keys %{$fence_order})
    {
        $anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => $debug, list => { "something_changed->{$node_name}" => $something_changed->{$node_name} }});
        if ($something_changed->{$node_name})

This is line 1654 as of when this bug was filed.

fabbione commented 5 months ago

We need this code fixed right now. It´s causing havoc in CI when multiple fence devices are defined.

http://anvil-ci-repo.ci.alteeve.com/testing-logs/dafaq.tar.gz

ipmi, apc, virt and delay are configured in the template but:

Fencing Levels:
  Target: an-a01n01
    Level 1 - ipmilan_node1
    Level 2 - apc_snmp_node1_an-pdu01,apc_snmp_node1_an-pdu02

Resources Defaults:

fence levels are incomplete for node1 and completely missing for node2.

fabbione commented 5 months ago

Configuring only IPMI in CI appears to do the trick:

Fencing Levels:
  Target: an-a01n01
    Level 1 - ipmilan_node1
    Level 2 - delay_node1
  Target: an-a01n02
    Level 1 - ipmilan_node2
    Level 2 - delay_node2
fabbione commented 5 months ago

I have run a few manual tests adding only apc to the template (drop gravitar/fence_virt) and one time the strikers failed to join the db in early stage and another couple of times they configured the nodes correctly.

There is clearly something deep going on inside the fence config code that must be addressed ASAP.

This is leaving aside that simengine apc is severely broken for other reasons.

fabbione commented 5 months ago

Changing the order in the config template does help to get to the right point. For example:

Fencing Levels:
  Target: an-a01n01
    Level 1 - ipmilan_node1
    Level 2 - apc_snmp_node1_an-pdu01,apc_snmp_node1_an-pdu02
    Level 3 - virt_node1_gravitar
    Level 4 - delay_node1
  Target: an-a01n02
    Level 1 - ipmilan_node2
    Level 2 - apc_snmp_node2_an-pdu01,apc_snmp_node2_an-pdu02
    Level 3 - virt_node2_gravitar
    Level 4 - delay_node2
digimer commented 5 months ago

Thanks for the update, I will work on this Monday/tomorrow