ClusterLabs / anvil

The Anvil! Intelligent Availability™ Platform, mark 3
5 stars 6 forks source link

ocf:alteeve:server constraints are changed #391

Open fabbione opened 1 year ago

fabbione commented 1 year ago

not sure about the priority yet, but this could explain the failed deployment.

Observing pacemaker during load with 5 servers deployments:

creation of the server:

Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++ /cib/configuration/resources:  <primitive class="ocf" id="an-test-deploy1" provider="alteeve" type="server"/>
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                  <instance_attributes id="an-test-deploy1-instance_attributes">
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                    <nvpair id="an-test-deploy1-instance_attributes-name" name="name" value="an-test-deploy1"/>
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                  <meta_attributes id="an-test-deploy1-meta_attributes">
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                    <nvpair id="an-test-deploy1-meta_attributes-allow-migrate" name="allow-migrate" value="true"/>
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                    <nvpair id="an-test-deploy1-meta_attributes-target-role" name="target-role" value="started"/>
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                    <op id="an-test-deploy1-migrate_from-interval-0s" interval="0s" name="migrate_from" on-fail="block" timeout="600"/>
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                    <op id="an-test-deploy1-migrate_to-interval-0s" interval="0s" name="migrate_to" on-fail="block" timeout="600"/>
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                    <op id="an-test-deploy1-monitor-interval-60" interval="60" name="monitor"/>
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                    <op id="an-test-deploy1-notify-interval-0s" interval="0s" name="notify" timeout="20"/>
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                    <op id="an-test-deploy1-start-interval-0s" interval="0s" name="start" on-fail="block" timeout="60"/>
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                    <op id="an-test-deploy1-stop-interval-0s" interval="0s" name="stop" on-fail="block" timeout="300"/>
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++ /cib/configuration/constraints:  <rsc_location id="location-an-test-deploy1-an-a01n01-200" node="an-a01n01" rsc="an-test-deploy1" score="200"/>
Jul 27 07:08:00 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++ /cib/configuration/constraints:  <rsc_location id="location-an-test-deploy1-an-a01n02-100" node="an-a01n02" rsc="an-test-deploy1" score="100"/>

change of the constraints that could cause the node to be migrated/stopped/started during deployment.

Jul 27 07:12:14 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++ /cib/configuration/constraints:  <rsc_location id="location-an-test-deploy1" rsc="an-test-deploy1"/>
Jul 27 07:12:14 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                    <rule id="location-an-test-deploy1-rule" score="-INFINITY">
Jul 27 07:12:14 an-a01n01.ci.alteeve.com pacemaker-based     [3955] (log_info)  info: ++                                      <expression attribute="drbd-fenced_an-test-deploy1" id="location-an-test-deploy1-rule-expr" operation="eq" value="1"/>

what is changing the location and why?

digimer commented 1 year ago

Referencing our conversation; Try to update the monitor timeout to 60 seconds, please do this is "tests-for-fabio" branch

digimer commented 1 year ago

Add logging attributes as well

digimer commented 1 year ago

Make sure that drbd fence rules are included in the atomic rules used during provision

fabbione commented 1 year ago
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++ /cib/configuration/resources:  <primitive class="ocf" id="an-test-deploy2" provider="alteeve" type="server"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                  <instance_attributes id="an-test-deploy2-instance_attributes">
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                    <nvpair id="an-test-deploy2-instance_attributes-log_level" name="log_level" value="2"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                    <nvpair id="an-test-deploy2-instance_attributes-log_secure" name="log_secure" value="1"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                    <nvpair id="an-test-deploy2-instance_attributes-name" name="name" value="an-test-deploy2"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                  <meta_attributes id="an-test-deploy2-meta_attributes">
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                    <nvpair id="an-test-deploy2-meta_attributes-allow-migrate" name="allow-migrate" value="true"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                    <nvpair id="an-test-deploy2-meta_attributes-target-role" name="target-role" value="stopped"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                    <op id="an-test-deploy2-migrate_from-interval-0s" interval="0s" name="migrate_from" on-fail="block" timeout="600"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                    <op id="an-test-deploy2-migrate_to-interval-0s" interval="0s" name="migrate_to" on-fail="block" timeout="600"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                    <op id="an-test-deploy2-monitor-interval-60" interval="60" name="monitor"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                    <op id="an-test-deploy2-notify-interval-0s" interval="0s" name="notify" timeout="20"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                    <op id="an-test-deploy2-start-interval-0s" interval="0s" name="start" on-fail="block" timeout="60"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++                                    <op id="an-test-deploy2-stop-interval-0s" interval="0s" name="stop" on-fail="block" timeout="300"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++ /cib/configuration/constraints:  <rsc_location id="location-an-test-deploy2-an-a01n01-200" node="an-a01n01" rsc="an-test-deploy2" score="200"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++ /cib/configuration/constraints:  <rsc_location id="location-an-test-deploy2-an-a01n02-100" node="an-a01n02" rsc="an-test-deploy2" score="100"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-controld  [3972] (abort_transition_graph)    notice: Transition 27 aborted by primitive.an-test-deploy2 'create': Configuration change | cib=0.58.0 source=te_update_diff_v2:464 path=/cib/configuration/resources complete=false
Jul 30 06:30:46 an-a01n01.ci.alteeve.com pacemaker-based     [3967] (log_info)  info: ++ /cib/configuration/nodes/node[@id='2']/instance_attributes[@id='nodes-2']:  <nvpair id="nodes-2-drbd-fenced_an-test-deploy2" name="drbd-fenced_an-test-deploy2" value="0"/>
Jul 30 06:30:46 an-a01n01.ci.alteeve.com pacemaker-controld  [3972] (abort_transition_graph)    info: Transition 27 aborted by nodes-2-drbd-fenced_an-test-deploy2 doing create drbd-fenced_an-test-deploy2=0: Configuration change | cib=0.61.0 source=te_update_diff_v2:464 path=/cib/configuration/nodes/node[@id='2']/instance_attributes[@id='nodes-2'] complete=false
Jul 30 06:30:48 an-a01n01.ci.alteeve.com pacemaker-schedulerd[3971] (pcmk__primitive_assign)    info: Resource an-test-deploy2 cannot run anywhere
Jul 30 06:30:48 an-a01n01.ci.alteeve.com pacemaker-schedulerd[3971] (rsc_action_default)        info: Leave   an-test-deploy2   (Stopped)

here is another example of a node being mis-managed during install.

digimer commented 1 year ago

This should be resolved by serializing pcs calls

digimer commented 1 year ago

Should be fixed in pr#408

fabbione commented 1 year ago
  * an-test-deploy1     (ocf::alteeve:server):   Started an-a01n01
  * an-test-deploy2     (ocf::alteeve:server):   Started an-a01n02
  * an-test-deploy3     (ocf::alteeve:server):   Stopped (disabled)
  * an-test-deploy4     (ocf::alteeve:server):   Started an-a01n01
  * an-test-deploy5     (ocf::alteeve:server):   Started an-a01n01