Open Vascko opened 5 years ago
Hi,
I am facing the same issue. I have downloaded ceph-iscsi-3.0 and tcmu-runner-1.4.0 from shaman
rbd-api-target gives the following status message
Can you provide a (sanitized) copy of your "gateway.conf" (rados -p rbd get gateway.conf -
)? The "pool" attribute has been a part of the disk structure for a very long time.
@gvikram18
This is a new install right? You didn't start from a old 2.x ceph-iscsi-config or github commit did you?
If this is a new install, I think the bug is that the initial creation failed but did not fully clean itself up. The second creation reported success but did not fully set it up.
What version is your rtslib? And is a distro rpm or did you install the upstream one from GitHub?
Do you have targetcli installed and if so is that a distro or upstream one?
Could you start from a clean slate? Do the following:
Make a /etc/target and /etc/target/pr dir if you do not have it.
It looks like there is a bug in some rtslib versions where if tagretcli has not created the /etc/target (or /var/target or it is not specified by or dir then when we try to create a device we will get a failure. This is due to some rtslib code checking for that dir and the pr dir in there or in configfs.
Start from a clean slate. Delete the bad gateway.conf
rados -p rbd rm gateway.conf
The above comment is not correct. It looks like we fixed all the partial setup errors by 3.0.
Starting from a clean slate like described above, can you provide the /var/log/rbd-target-api/rbd-target-api.log for when you try to create the disk? I cannot replicate the issue here.
I just encountered the above issue: Running:
[root@cephigw002-v06c ~]# rpm -qa | grep -i -e rtslib -e iscsi -e tcmu python-rtslib-2.1.fb68-1.noarch libiscsi-1.9.0-7.el7.x86_64 tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64 ceph-iscsi-3.0.1-1.el7.noarch libtcmu-1.4.0-106.gd17d24e.el7.x86_64
python 2.7.5
The issue occured when attempting to add a disk directly through the rbd-target-api (using curl) where the error received was:
disk create/update failed on vm1cephigw002. Unhandled exception: 'backstore_object_name'
when using 'gwcli -d' I got the same "KeyError: 'pool'" as above in the original post.
I fixed it by pulling down the configuration object, and searching for the conf for the disk name that I attempted to create. I noticed that the disk was different from the other, working, disks in that it only had the "created" key and lacked the others "pool", "allocating_host" etc. Upon removing the json section for this disk reuploading via rados put and finally restarting rbd-target-api on all GWs things were back to normal.
Thankfully this was on our DEV cluster so am not sure if this would have been disruptive to client IO in a production cluster but wondering if this fixed in a later release? Thanks.
@wwdillingham
Sorry for the late reply. I have been on PTO. It is not fixed yet. I am not able to replicate the problem and was waiting on logs in my last comment.
Can you:
Give me the curl command you used? Maybe we are parsing a specific string wrong, so if possible could you give me the exact values you used?
Does it happen every time you run the command?
Did gwcli disk creation work?
Could you give me the /var/log/rbd-target-api/rbd-target-api.log for when this happens?
- Could you give me the /var/log/rbd-target-api/rbd-target-api.log for when this happens?
Oh yeah, since this was days ago now, the log info might be in the /var/log/rbd-target-api dir in one of the gzipped up files.
@mikechristie
1) I spoke incorrectly the initial API call was with a .NET framework an external client is using via the exposed rbd-target-api. Also the error msg I initially gave you from the API "disk create/update failed on vm1cephigw002. Unhandled exception: 'backstore_object_name'" was in fact from subsequent failures (including via curl - all made while the config object was in broken state), not from the first attempt. However I can report that the request was made with a
PUT /disk/rbd/plesk_test0 body: "mode=create&size=256m&pool=rbd&create_image=true"
2) It does not happen every time we run the command. The same method that initially failed subsequently worked after restoring the config object and removing the rbd via the rbd command.
3) I did not attempt to do a gwcli disk creation because I was unable to "enter" gwcli, gwcli would error out with: "KeyError: Pool"
4) I can get you all logs that I have but would prefer to send off github, how can i best get them to you? I can also provide the contents of the config object in its errored state.
Further, I can say that I was able to quickly bounce the rbd-target-api service on our IGW01, but not on our IGW02 (which is the node listed in the rbd-target-api error above).
Ok, I see one way to hit it now, but am not sure if everyone is hitting the same thing.
@wwdillingham on the iscsi target systems:
Do you have targetcli installs on all of them? Is one of the systems missing it?
Do all the systems have a /etc/target or /var/target?
?
It seems some versions of rtslib require one of those dirs. If you install targetcli then they will get made. If you do not have the dirs, then we can end up partially creating the disk. We will then hit other bugs because it only got partially created and not fully setup, and we did not fully clean it up when it failed.
So no matter what we need to fix the error handler so it fully cleans up partially created disks so if we hit any failures we do not end up in this state.
We also need to fix rtslib/targetcli so rtslib creates the dirs it needs. As a temp hack we can just install targetcli and/or have ceph-iscsi make the dirs.
@mikechristie no package matching "targetcli" on either of the IGW nodes. Also neither of those directories exist.
So is targetcli package needed ONLY for the purposes of creating those dirs? /etc/target & /var/target ? I can count myself lucky I haven't encountered more problems I think.
Yes, it's only needed because that package makes the dirs that rtslib uses. I'm not sure why rtslib has some dirs names hardcoded but then relies on other apps to make them.
Just one clarification. The targetcli rpm and/or if your distro has it the target-restore rpm makes the dirs.
If you re installing from the upstream repo tarball releases or from the GH repo source code, then you have to manually make the dirs as a temp workaround.
@mikechristie thanks for your help on this one. I have always been pulling my RPMs from shaman, perhaps why I overlooked targetcli. I was able to snatch targetcli 2.1.fb49-1.el7 from base centos repos. This created /etc/target but not /var/target.
I think the feature of cleaning up partially created disks or otherwise validating the config object as correct before committing would be great. Thanks again for the help.
Hey Guys,
Creating a disk in
gwcli
succeeds withok
but then nosedives with a 'KeyError: pool'.The odd thing is that from that point on invoking
gwcli
immediately crashes with the same error.It's only fixable by completely destroying the pool and starting over.
Another snag I hit with python3 is the systemd files:
installs the binaries under /usr/local/bin but the provides systemd units invoke binaries in /usr/bin. It's an easy manual fix just wanted to let you know