Open fabbione opened 1 year ago
Referencing our conversation; Try to update the monitor timeout to 60 seconds, please do this is "tests-for-fabio" branch
Add logging attributes as well
Make sure that drbd fence rules are included in the atomic rules used during provision
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ /cib/configuration/resources: <primitive class="ocf" id="an-test-deploy2" provider="alteeve" type="server"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <instance_attributes id="an-test-deploy2-instance_attributes">
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <nvpair id="an-test-deploy2-instance_attributes-log_level" name="log_level" value="2"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <nvpair id="an-test-deploy2-instance_attributes-log_secure" name="log_secure" value="1"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <nvpair id="an-test-deploy2-instance_attributes-name" name="name" value="an-test-deploy2"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <meta_attributes id="an-test-deploy2-meta_attributes">
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <nvpair id="an-test-deploy2-meta_attributes-allow-migrate" name="allow-migrate" value="true"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <nvpair id="an-test-deploy2-meta_attributes-target-role" name="target-role" value="stopped"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <op id="an-test-deploy2-migrate_from-interval-0s" interval="0s" name="migrate_from" on-fail="block" timeout="600"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <op id="an-test-deploy2-migrate_to-interval-0s" interval="0s" name="migrate_to" on-fail="block" timeout="600"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <op id="an-test-deploy2-monitor-interval-60" interval="60" name="monitor"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <op id="an-test-deploy2-notify-interval-0s" interval="0s" name="notify" timeout="20"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <op id="an-test-deploy2-start-interval-0s" interval="0s" name="start" on-fail="block" timeout="60"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ <op id="an-test-deploy2-stop-interval-0s" interval="0s" name="stop" on-fail="block" timeout="300"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ /cib/configuration/constraints: <rsc_location id="location-an-test-deploy2-an-a01n01-200" node="an-a01n01" rsc="an-test-deploy2" score="200"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ /cib/configuration/constraints: <rsc_location id="location-an-test-deploy2-an-a01n02-100" node="an-a01n02" rsc="an-test-deploy2" score="100"/>
Jul 30 06:28:55 an-a01n01.ci.alteeve.com pacemaker-controld [3972] (abort_transition_graph) notice: Transition 27 aborted by primitive.an-test-deploy2 'create': Configuration change | cib=0.58.0 source=te_update_diff_v2:464 path=/cib/configuration/resources complete=false
Jul 30 06:30:46 an-a01n01.ci.alteeve.com pacemaker-based [3967] (log_info) info: ++ /cib/configuration/nodes/node[@id='2']/instance_attributes[@id='nodes-2']: <nvpair id="nodes-2-drbd-fenced_an-test-deploy2" name="drbd-fenced_an-test-deploy2" value="0"/>
Jul 30 06:30:46 an-a01n01.ci.alteeve.com pacemaker-controld [3972] (abort_transition_graph) info: Transition 27 aborted by nodes-2-drbd-fenced_an-test-deploy2 doing create drbd-fenced_an-test-deploy2=0: Configuration change | cib=0.61.0 source=te_update_diff_v2:464 path=/cib/configuration/nodes/node[@id='2']/instance_attributes[@id='nodes-2'] complete=false
Jul 30 06:30:48 an-a01n01.ci.alteeve.com pacemaker-schedulerd[3971] (pcmk__primitive_assign) info: Resource an-test-deploy2 cannot run anywhere
Jul 30 06:30:48 an-a01n01.ci.alteeve.com pacemaker-schedulerd[3971] (rsc_action_default) info: Leave an-test-deploy2 (Stopped)
here is another example of a node being mis-managed during install.
This should be resolved by serializing pcs calls
Should be fixed in pr#408
* an-test-deploy1 (ocf::alteeve:server): Started an-a01n01
* an-test-deploy2 (ocf::alteeve:server): Started an-a01n02
* an-test-deploy3 (ocf::alteeve:server): Stopped (disabled)
* an-test-deploy4 (ocf::alteeve:server): Started an-a01n01
* an-test-deploy5 (ocf::alteeve:server): Started an-a01n01
not sure about the priority yet, but this could explain the failed deployment.
Observing pacemaker during load with 5 servers deployments:
creation of the server:
change of the constraints that could cause the node to be migrated/stopped/started during deployment.
what is changing the location and why?