ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
493 stars 582 forks source link

iSCSITarget always fails to start (LIO-T mode) #1026

Open johnkeates opened 7 years ago

johnkeates commented 7 years ago

The iSCSITarget RA never succeeds in automatically starting a target. It seems to first create the target and then tries to create it again (which obviously fails), and then exit with error 1. Checking targetcli shows the target, so it's not exiting gracefully either.

Manually starting it twice with pcs resource debug-start does make it work, but then it still fails in a different way: it never adds the target portal so no connections from initiators can be made.

The target primitive is about as simple as it gets:

primitive iscsi0-target iSCSITarget \
        params implementation=lio-t iqn="iqn.2017-08.access.net:prod-1-ha"  \
        op monitor interval=30s \
        meta target-role=Started

corosync: 2.4.2 pacemaker: 1.1.16 targetcli-fb: 2.1.43 OS: Debian 9.1

The RA scripts were one revision behind the ones in this repo, the only difference was targetcli lock file sharing between LUN and Target setup. I replaced the ones I had with the ones from the repo, but that didn't change anything (and I didn't really expect it to).

johnkeates commented 7 years ago

Interestingly, on line 335 in the iSCSITarget RA, it states:

                # lio distinguishes between targets and target portal
                # groups (TPGs). We will always create one TPG, with the
                # number 1. In lio, creating a network portal
                # automatically creates the corresponding target if it
                # doesn't already exist.

So this was already known...

johnkeates commented 7 years ago

Digging around some more: on line 334 there seems to be a case where a loop containing an if/else block might create a default target for the tpg if a default portal was found:

                for portal in ${OCF_RESKEY_portals}; do
                        if [ $portal != ${OCF_RESKEY_portals_default} ] ; then
                                IFS=':' read -a sep_portal <<< "$portal"
                                ocf_run targetcli /iscsi/${OCF_RESKEY_iqn}/tpg1/portals create "${sep_porta$
                        else
                                ocf_run targetcli /iscsi create ${OCF_RESKEY_iqn} || exit $OCF_ERR_GENERIC
                        fi
                done

This is later on not checked before the actual function that is supposed to create the target and this causes the issue.

So specifying a portal="0.0.0.0.0:3260" or no portal at all wil cause iSCSITarget to fail since LIO-T will have created a target automatically before the RA reaches the point where it wants to create the target itself.

Manually enumerating all the portals for a target resolves this, but isn't really what you want.

dmuhamedagic commented 7 years ago

On Fri, Aug 25, 2017 at 07:45:22AM -0700, John Keates wrote:

The iSCSITarget RA never succeeds in automatically starting a target. It seems to first create the target and then tries to create it again (which obviously fails),

The start action must be idempotent, so this is already a problem. The relevant error is this:

ERROR: This Target already exists in configFS

and then exit with error 1. Checking targetcli shows the target, so it's not exiting gracefully either.

Manually starting it twice with pcs resource debug-start does make it work, but then it still fails in a different way: it never adds the target portal so no connections from initiators can be made.

Did you try to post to the pacemaker ML? You may get more audience there about RA behaviour.

Did you open a bug with Debian?

colttt commented 6 years ago

any news or how to fix that?

johnkeates commented 6 years ago

Not from me, sorry. We moved our setup away from HA Clustered to HA load-balanced with plenty of spare capacity to have stuff fail without impact.

bvdheuvel commented 5 years ago

Hit this bug too on centos 7.6.1810 Patch fixes the startup. But I needed changing the resource with pcs resource update portals=:::3260 (which fixes this, because it's different from the default 0.0.0.0:3260)

oalbrigt commented 5 years ago

The patch in PR 1239 above should fix this issue for you.

bvdheuvel commented 5 years ago

How can I see what version CentOS is using ? The agent is in package: resource-agents-4.1.1-12.el7_6.8.x86_64

oalbrigt commented 5 years ago

It isnt available in a release yet, but you should be able patch it manually, or report to CentOS that the patch should be applied to their current version.

bvdheuvel commented 5 years ago

Thanks, I have it patched manually for now. But I want to make sure when I upgrade the package it is with the right version. Is the automatic portal generation also fixed ?