ceph / ceph-iscsi

Ceph iSCSI tools
GNU General Public License v3.0
63 stars 59 forks source link

Python3 - KeyError: 'pool' when 'refresh' is invoked #99

Open Vascko opened 5 years ago

Vascko commented 5 years ago

Hey Guys,

Creating a disk in gwcli succeeds with ok but then nosedives with a 'KeyError: pool'.

/disks> create rbd/esxi 100G
user provided pool/image format request
CMD: /disks/ create pool=rbd image=esxi size=100G count=1
pool 'rbd' is ok to use
Creating/mapping disk rbd/esxi
Issuing disk create request
- LUN(s) ready on all gateways
ok
Updating UI for the new disk(s)
Traceback (most recent call last):
  File "/usr/local/bin/gwcli", line 4, in <module>
    __import__('pkg_resources').run_script('ceph-iscsi==3.0', 'gwcli')
  File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1453, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/EGG-INFO/scripts/gwcli", line 194, in <module>
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/EGG-INFO/scripts/gwcli", line 125, in main
  File "/usr/lib/python3.7/site-packages/configshell_fb/shell.py", line 905, in run_interactive
    self._cli_loop()
  File "/usr/lib/python3.7/site-packages/configshell_fb/shell.py", line 734, in _cli_loop
    self.run_cmdline(cmdline)
  File "/usr/lib/python3.7/site-packages/configshell_fb/shell.py", line 848, in run_cmdline
    self._execute_command(path, command, pparams, kparams)
  File "/usr/lib/python3.7/site-packages/configshell_fb/shell.py", line 823, in _execute_command
    result = target.execute_command(command, pparams, kparams)
  File "/usr/lib/python3.7/site-packages/configshell_fb/node.py", line 1406, in execute_command
    return method(*pparams, **kparams)
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/gwcli/storage.py", line 261, in ui_command_create
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/gwcli/storage.py", line 355, in create_disk
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/gwcli/storage.py", line 601, in __init__
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/gwcli/storage.py", line 605, in refresh
KeyError: 'pool'

The odd thing is that from that point on invoking gwcli immediately crashes with the same error.

[root@ceph-igw01 ~]# gwcli -d
Adding ceph cluster 'ceph' to the UI
Fetching ceph osd information
Querying ceph for state information
Refreshing disk information from the config object
- Scanning will use 8 scan threads
- rbd image scan complete: 0s
Traceback (most recent call last):
  File "/usr/local/bin/gwcli", line 4, in <module>
    __import__('pkg_resources').run_script('ceph-iscsi==3.0', 'gwcli')
  File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1453, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/EGG-INFO/scripts/gwcli", line 194, in <module>
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/EGG-INFO/scripts/gwcli", line 105, in main
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/gwcli/gateway.py", line 65, in refresh
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/gwcli/storage.py", line 139, in refresh
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/gwcli/storage.py", line 601, in __init__
  File "/usr/local/lib/python3.7/site-packages/ceph_iscsi-3.0-py3.7.egg/gwcli/storage.py", line 605, in refresh
KeyError: 'pool'

It's only fixable by completely destroying the pool and starting over.

Another snag I hit with python3 is the systemd files:

python3 setup.py install

installs the binaries under /usr/local/bin but the provides systemd units invoke binaries in /usr/bin. It's an easy manual fix just wanted to let you know

gvikram18 commented 5 years ago

Hi,

I am facing the same issue. I have downloaded ceph-iscsi-3.0 and tcmu-runner-1.4.0 from shaman

Screenshot from 2019-07-24 17-59-37

Screenshot from 2019-07-24 18-02-58

rbd-api-target gives the following status message

Screenshot from 2019-07-24 18-19-28

dillaman commented 5 years ago

Can you provide a (sanitized) copy of your "gateway.conf" (rados -p rbd get gateway.conf -)? The "pool" attribute has been a part of the disk structure for a very long time.

gvikram18 commented 5 years ago

Screenshot from 2019-07-25 10-34-14

mikechristie commented 5 years ago

@gvikram18

This is a new install right? You didn't start from a old 2.x ceph-iscsi-config or github commit did you?

If this is a new install, I think the bug is that the initial creation failed but did not fully clean itself up. The second creation reported success but did not fully set it up.

What version is your rtslib? And is a distro rpm or did you install the upstream one from GitHub?

Do you have targetcli installed and if so is that a distro or upstream one?

Could you start from a clean slate? Do the following:

  1. Make a /etc/target and /etc/target/pr dir if you do not have it.

    It looks like there is a bug in some rtslib versions where if tagretcli has not created the /etc/target (or /var/target or it is not specified by or dir then when we try to create a device we will get a failure. This is due to some rtslib code checking for that dir and the pr dir in there or in configfs.

  2. Start from a clean slate. Delete the bad gateway.conf

rados -p rbd rm gateway.conf

  1. Restart the gws. Either reboot the node or stop and start the rbd-target-api service.
mikechristie commented 5 years ago

The above comment is not correct. It looks like we fixed all the partial setup errors by 3.0.

Starting from a clean slate like described above, can you provide the /var/log/rbd-target-api/rbd-target-api.log for when you try to create the disk? I cannot replicate the issue here.

wwdillingham commented 4 years ago

I just encountered the above issue: Running:

[root@cephigw002-v06c ~]# rpm -qa | grep -i -e rtslib -e iscsi -e tcmu python-rtslib-2.1.fb68-1.noarch libiscsi-1.9.0-7.el7.x86_64 tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64 ceph-iscsi-3.0.1-1.el7.noarch libtcmu-1.4.0-106.gd17d24e.el7.x86_64

python 2.7.5

The issue occured when attempting to add a disk directly through the rbd-target-api (using curl) where the error received was:

disk create/update failed on vm1cephigw002. Unhandled exception: 'backstore_object_name'

when using 'gwcli -d' I got the same "KeyError: 'pool'" as above in the original post.

I fixed it by pulling down the configuration object, and searching for the conf for the disk name that I attempted to create. I noticed that the disk was different from the other, working, disks in that it only had the "created" key and lacked the others "pool", "allocating_host" etc. Upon removing the json section for this disk reuploading via rados put and finally restarting rbd-target-api on all GWs things were back to normal.

Thankfully this was on our DEV cluster so am not sure if this would have been disruptive to client IO in a production cluster but wondering if this fixed in a later release? Thanks.

mikechristie commented 4 years ago

@wwdillingham

Sorry for the late reply. I have been on PTO. It is not fixed yet. I am not able to replicate the problem and was waiting on logs in my last comment.

Can you:

  1. Give me the curl command you used? Maybe we are parsing a specific string wrong, so if possible could you give me the exact values you used?

  2. Does it happen every time you run the command?

  3. Did gwcli disk creation work?

  4. Could you give me the /var/log/rbd-target-api/rbd-target-api.log for when this happens?

mikechristie commented 4 years ago
  1. Could you give me the /var/log/rbd-target-api/rbd-target-api.log for when this happens?

Oh yeah, since this was days ago now, the log info might be in the /var/log/rbd-target-api dir in one of the gzipped up files.

wwdillingham commented 4 years ago

@mikechristie

1) I spoke incorrectly the initial API call was with a .NET framework an external client is using via the exposed rbd-target-api. Also the error msg I initially gave you from the API "disk create/update failed on vm1cephigw002. Unhandled exception: 'backstore_object_name'" was in fact from subsequent failures (including via curl - all made while the config object was in broken state), not from the first attempt. However I can report that the request was made with a

PUT /disk/rbd/plesk_test0 body: "mode=create&size=256m&pool=rbd&create_image=true"

2) It does not happen every time we run the command. The same method that initially failed subsequently worked after restoring the config object and removing the rbd via the rbd command.

3) I did not attempt to do a gwcli disk creation because I was unable to "enter" gwcli, gwcli would error out with: "KeyError: Pool"

4) I can get you all logs that I have but would prefer to send off github, how can i best get them to you? I can also provide the contents of the config object in its errored state.

Further, I can say that I was able to quickly bounce the rbd-target-api service on our IGW01, but not on our IGW02 (which is the node listed in the rbd-target-api error above).

mikechristie commented 4 years ago

Ok, I see one way to hit it now, but am not sure if everyone is hitting the same thing.

@wwdillingham on the iscsi target systems:

  1. Do you have targetcli installs on all of them? Is one of the systems missing it?

  2. Do all the systems have a /etc/target or /var/target?

?

It seems some versions of rtslib require one of those dirs. If you install targetcli then they will get made. If you do not have the dirs, then we can end up partially creating the disk. We will then hit other bugs because it only got partially created and not fully setup, and we did not fully clean it up when it failed.

mikechristie commented 4 years ago

So no matter what we need to fix the error handler so it fully cleans up partially created disks so if we hit any failures we do not end up in this state.

We also need to fix rtslib/targetcli so rtslib creates the dirs it needs. As a temp hack we can just install targetcli and/or have ceph-iscsi make the dirs.

wwdillingham commented 4 years ago

@mikechristie no package matching "targetcli" on either of the IGW nodes. Also neither of those directories exist.

wwdillingham commented 4 years ago

So is targetcli package needed ONLY for the purposes of creating those dirs? /etc/target & /var/target ? I can count myself lucky I haven't encountered more problems I think.

mikechristie commented 4 years ago

Yes, it's only needed because that package makes the dirs that rtslib uses. I'm not sure why rtslib has some dirs names hardcoded but then relies on other apps to make them.

mikechristie commented 4 years ago

Just one clarification. The targetcli rpm and/or if your distro has it the target-restore rpm makes the dirs.

If you re installing from the upstream repo tarball releases or from the GH repo source code, then you have to manually make the dirs as a temp workaround.

wwdillingham commented 4 years ago

@mikechristie thanks for your help on this one. I have always been pulling my RPMs from shaman, perhaps why I overlooked targetcli. I was able to snatch targetcli 2.1.fb49-1.el7 from base centos repos. This created /etc/target but not /var/target.

I think the feature of cleaning up partially created disks or otherwise validating the config object as correct before committing would be great. Thanks again for the help.