ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
493 stars 579 forks source link

ERROR: LXC container name not set! #1857

Open iglov opened 1 year ago

iglov commented 1 year ago

OS: Debian 11 (And debian 10) Kernel: 5.10.0-15-amd64 Env: resource-agents 1:4.7.0-1~bpo10+1, pacemaker 2.0.5-2, corosync 3.1.2-2, lxc 1:4.0.6-2

Just trying to add new resource

lxc-start -n front-2.fr
pcs resource create front-2.fr ocf:heartbeat:lxc config=/mnt/cluster_volumes/lxc2/front-2.fr/config container=front-2.fr

After ~5min want to remove it pcs resource remove front-2.fr --force got an error and cluster starts to migrate Mar 29 23:28:51 cse2.fr lxc(front-2.fr)[2103391]: ERROR: LXC container name not set!

as i can see in /usr/lib/ocf/resource.d/heartbeat/lxc the error spawns when agent can't get OCF_RESKEY_container variable. This bug is only on clusters who work without reboot a long time. For example after fencing i can add/remove lxc resources and everything will be fine for a while.

The question is: why? And how to debug it?

oalbrigt commented 1 year ago

This might be due to the probe-action.

You can try changing https://github.com/ClusterLabs/resource-agents/blob/fe1a2f88ac32dfaba86baf995094e2b4fa0d8def/heartbeat/lxc.in#L343 to ocf_is_probe || LXC_validate.

oalbrigt commented 1 year ago

Seems like the agent already takes care of probe-actions, so I'll have to investigate further what might cause it.

iglov commented 1 year ago

Hey @oalbrigt , thanks 4 reply!

to ocf_is_probe || LXC_validate.

Yep, ofc i can try, but what the point if as we can see, the OCF_RESKEY_container var isn't exists or the agent just doesn't know anything about it. So even if i'll try it, he wont stop the container here for the same reason https://github.com/ClusterLabs/resource-agents/blob/fe1a2f88ac32dfaba86baf995094e2b4fa0d8def/heartbeat/lxc.in#L184

oalbrigt commented 1 year ago

@kgaillot Do you know what might cause OCFRESKEY variables not being set when doing pcs resource remove --force?

kgaillot commented 1 year ago

@kgaillot Do you know what might cause OCFRESKEY variables not being set when doing pcs resource remove --force?

No, that's odd. Was the command tried without --force first? It shouldn't normally be necessary, so if it was, that might point to an issue.

iglov commented 1 year ago

Hey @kgaillot , thx 4 reply! Nope, without --force the result is the same.

kgaillot commented 1 year ago

@iglov @oalbrigt , can one of you try dumping the environment to a file from within the stop command? Are no OCF variables set, or is just that one missing?

iglov commented 1 year ago

Well, i can try if you tell me how to do that and if i find cluster in the same state.

kgaillot commented 1 year ago

Something like env > /run/lxc.env in the agent's stop action

iglov commented 1 year ago

Oh, you mean i should place env > /run/lxc.env somewhere in the /usr/lib/ocf/resource.d/heartbeat/lxc in LXC_stop() { ... } ? But it won't work cuz: 1. It died before LXC_stop() in the LXC_validate() ; 2. After fencing node will reboot and/run unmounts. So, i think it would be better to put env > /root/lxc.env in LXC_validate() If all correct i will try when find the cluster with this bug.

kgaillot commented 1 year ago

That sounds right

iglov commented 7 months ago

Hey guyz! I got it. Tried to stop container nsa-1.ny with pcs resource remove nsa-1.ny --force and got some debug:

OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_crm_feature_set=3.1.0
HA_LOGFACILITY=daemon
PCMK_debug=0
HA_debug=0
PWD=/var/lib/pacemaker/cores
HA_logfacility=daemon
OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
OCF_RESOURCE_PROVIDER=heartbeat
PCMK_service=pacemaker-execd
PCMK_mcp=true
OCF_RA_VERSION_MAJOR=1
VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions --gen-suppressions=all
HA_cluster_type=corosync
INVOCATION_ID=5d3831d43d924a08a3dad6f49613e661
OCF_RESOURCE_INSTANCE=nsa-1.ny
HA_quorum_type=corosync
OCF_RA_VERSION_MINOR=0
HA_mcp=true
PCMK_quorum_type=corosync
SHLVL=1
OCF_RESKEY_CRM_meta_on_node=mfs4.ny.local
PCMK_watchdog=false
OCF_RESKEY_CRM_meta_timeout=20000
OCF_RESOURCE_TYPE=lxc
PCMK_logfacility=daemon
LC_ALL=C
JOURNAL_STREAM=9:36160
OCF_RESKEY_CRM_meta_on_node_uuid=2
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
PCMK_cluster_type=corosync
_=/usr/bin/env

And this how it should looks like

OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_crm_feature_set=3.1.0
HA_LOGFACILITY=daemon
PCMK_debug=0
HA_debug=0
PWD=/var/lib/pacemaker/cores
HA_logfacility=daemon
OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
OCF_RESOURCE_PROVIDER=heartbeat
PCMK_service=pacemaker-execd
PCMK_mcp=true
OCF_RA_VERSION_MAJOR=1
VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions --gen-suppressions=all
HA_cluster_type=corosync
INVOCATION_ID=b062591edd5142bd952b5ecc4f86b493
OCF_RESKEY_CRM_meta_interval=30000
OCF_RESOURCE_INSTANCE=nsa-1.ny
HA_quorum_type=corosync
OCF_RA_VERSION_MINOR=0
HA_mcp=true
OCF_RESKEY_config=/mnt/cluster_volumes/lxc2/nsa-1.ny/config
PCMK_quorum_type=corosync
OCF_RESKEY_CRM_meta_name=monitor
SHLVL=1
OCF_RESKEY_container=nsa-1.ny
OCF_RESKEY_CRM_meta_on_node=mfs4.ny.local
PCMK_watchdog=false
OCF_RESKEY_CRM_meta_timeout=20000
OCF_RESOURCE_TYPE=lxc
PCMK_logfacility=daemon
LC_ALL=C
JOURNAL_STREAM=9:44603
OCF_RESKEY_CRM_meta_on_node_uuid=2
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
PCMK_cluster_type=corosync
_=/usr/bin/env

As you can see, there miss some variables like OCF_RESKEY_container or OCF_RESKEY_config

Any ideas? ^_^

oalbrigt commented 7 months ago

That's strange. Did you create it without specifying container=<container name> and using -f to force it? What does you pcs resource config output say?

iglov commented 7 months ago

Yes, it's very, VERY strange. I create resources with pcs resource create test ocf:heartbeat:lxc container=test config=/mnt/cluster_volumes/lxc1/test/config (you can see it on topic) BUT it does not matter, cuz as i said earlier:

This bug is only on clusters who work without reboot a long time. For example after fencing i can add/remove lxc resources and everything will be fine for a while.

As you can see, almost a year has passed before the bug appeared. This means, i can create resource with ANY method and it WILL work correctly until... something goes wrong. With pcs resource config everything is good:

  Resource: nsa-1.ny (class=ocf provider=heartbeat type=lxc)
   Attributes: config=/mnt/cluster_volumes/lxc2/nsa-1.ny/config container=nsa-1.ny
   Operations: monitor interval=30s timeout=20s (nsa-1.ny-monitor-interval-30s)
               start interval=0s timeout=60s (nsa-1.ny-start-interval-0s)
               stop interval=0s timeout=60s (nsa-1.ny-stop-interval-0s)

Soo-o-o-o, i have no idea how to debug it further :(

oalbrigt commented 7 months ago

Can you add the output from rpm -qa | grep pacemaker? So I can have our Pacemaker devs see if this is a known issue.

iglov commented 7 months ago

Yep, sure, but i have it on debian:

# dpkg -l | grep pacemaker
ii  pacemaker                            2.0.1-5                      amd64        cluster resource manager
ii  pacemaker-cli-utils                  2.0.1-5                      amd64        cluster resource manager command line utilities
ii  pacemaker-common                     2.0.1-5                      all          cluster resource manager common files
ii  pacemaker-resource-agents            2.0.1-5                      all          cluster resource manager general resource agents

# dpkg -l | grep corosync
ii  corosync                             3.0.1-2+deb10u1              amd64        cluster engine daemon and utilities
ii  corosync-qdevice                     3.0.0-4+deb10u1              amd64        cluster engine quorum device daemon
ii  libcorosync-common4:amd64            3.0.1-2+deb10u1              amd64        cluster engine common library

# dpkg -l | grep resource-agents
ii  pacemaker-resource-agents            2.0.1-5                      all          cluster resource manager general resource agents
ii  resource-agents                      1:4.7.0-1~bpo10+1            amd64        Cluster Resource Agents

# dpkg -l | grep lxc
ii  liblxc1                              1:3.1.0+really3.0.3-8        amd64        Linux Containers userspace tools (library)
ii  lxc                                  1:3.1.0+really3.0.3-8        amd64        Linux Containers userspace tools
ii  lxc-templates                        3.0.4-0+deb10u1              amd64        Linux Containers userspace tools (templates)
ii  lxcfs                                3.0.3-2                      amd64        FUSE based filesystem for LXC
kgaillot commented 7 months ago

@iglov That is extremely odd. If you still have the logs from when that occurred, can you open a bug at bugs.clusterlabs.org and attach the output of crm_report -S --from="YYYY-M-D H:M:S" --to="YYYY-M-D H:M:S" from each node, covering the half hour or so around when the failed stop happened?

iglov commented 7 months ago

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

kgaillot commented 7 months ago

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

It would be helpful to at least get the scheduler input that led to the problem. At the time the problem occurred, one of the nodes was the designated controller (DC). It will have a log message like "Calculated transition ... saving inputs in ...". The last message before the problem occurred is the interesting one, and the file name is the input. You can uncompress it and edit out any sensitive information, then email it to kgaillot@redhat.com.

kgaillot commented 7 months ago

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

It would be helpful to at least get the scheduler input that led to the problem. At the time the problem occurred, one of the nodes was the designated controller (DC). It will have a log message like "Calculated transition ... saving inputs in ...". The last message before the problem occurred is the interesting one, and the file name is the input. You can uncompress it and edit out any sensitive information, then email it to kgaillot@redhat.com.

Alternatively you can investigate the file yourself. I'd start with checking the resource configuration and make sure the resource parameters are set correctly there. If they're not, someone or something likely modified the configuration. If they are, the next thing I'd try is crm_simulate -Sx $FILENAME -G graph.xml. The command output should show a stop scheduled on the old node and a start scheduled on the new node (if not, you probably have the wrong input). The graph.xml file should have <rsc_op> entries for the stop and start with all the parameters that will be passed to the agent.

iglov commented 7 months ago

Hey @kgaillot ! Thanks 4 explanations and ur time! Well, i have there something like that

# 0-5 synapses about stonith

<synapse id="6">
  <action_set>
    <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs/>
</synapse>
<synapse id="7">
  <action_set>
    <rsc_op id="33" operation="delete" operation_key="nsa-1.ny_delete_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs>
    <trigger>
      <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2"/>
    </trigger>
  </inputs>
</synapse>
<synapse id="8">
  <action_set>
    <rsc_op id="31" operation="delete" operation_key="nsa-1.ny_delete_0" on_node="mfs3.ny.local.priv" on_node_uuid="1">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs3.ny.local.priv" CRM_meta_on_node_uuid="1" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs>
    <trigger>
      <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2"/>
    </trigger>
  </inputs>
</synapse>
<synapse id="9">
  <action_set>
    <crm_event id="26" operation="clear_failcount" operation_key="nsa-1.ny_clear_failcount_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_op_no_wait="true" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </crm_event>
  </action_set>
  <inputs/>
</synapse>

looks good, isn't it? I don't see anything wrong here. But if you still want, i can try to sent you these pe-input files.

kgaillot commented 7 months ago

No, something's wrong. The resource parameters should be listed in <attributes> after the meta-attributes (like config="/mnt/cluster_volumes/lxc2/nsa-1.ny/config" container="nsa-1.ny"). Check the corresponding pe-input to see if those are properly listed under the relevant <primitive>.

iglov commented 7 months ago

Yep, sry, u right, my bad. I tried to find resource nsa-1.ny in pe-input-250 (this one is the last before fuckup) and there is no that primitive there at all. But it is in pe-input-249. Pooof, it's just disappeared...