ceph / ceph-iscsi-cli

NOTICE: moved to https://github.com/ceph/ceph-iscsi
GNU General Public License v3.0
25 stars 25 forks source link

Kernel hang with RHEL/Centos 7.5 when ABORTs are sent. #102

Open Jazz9 opened 6 years ago

Jazz9 commented 6 years ago

Hello,

Centos 7.5 VMware 6.0 U3 Ceph 12.2.5 with all SSD bluestore pool. ( only 20 OSD in SSD pool ) Active/standby iscsi

I have started using ceph-iscsi-cli in a small - non important production environment and last night both iscsi nodes crashed. A reboot fixed them ( they wouldn't reboot without a hard reset. The OS seemed fine but a hung task is my guess )

I am not 100% sure where to look for the logs that relate to the issue. I will put more monitoring on the system.

May 25 15:54:51 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 45 May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: 2018-05-25 15:55:11.794 856 [ERROR] tcmu_notify_conn_lost:187 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Handler connection lost (lock state 0) May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: tcmu_notify_conn_lost:187 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Handler connection lost (lock state 0) May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 45 May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 45 May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: 2018-05-25 15:55:11.798 856 [INFO] tgt_port_grp_recovery_thread_fn:253: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_2. May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: tgt_port_grp_recovery_thread_fn:253: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_2. May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:15 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:15 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:18 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:18 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:21 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:21 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:24 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:24 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:27 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:27 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:29 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw

May 25 16:27:47 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 16:27:47 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 16:27:49 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 16:27:49 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: INFO: task fn-radosclient:1555 blocked for more than 120 seconds. May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: fn-radosclient D ffff9df8b2218fd0 0 1555 1 0x00000080 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: Call Trace: May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule+0x29/0x70 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule_timeout+0x239/0x2c0 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? __send_signal+0x18e/0x490 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] wait_for_completion+0xfd/0x140 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_state+0x20/0x20 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_cause_connection_reinstatement+0x9e/0x100 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_free_session+0x109/0x180 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_release_sessions_for_tpg+0x123/0x1e0 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_tpg_disable_portal_group+0xcf/0x1e0 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] lio_target_tpg_enable_store+0x6e/0xf0 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] configfs_write_file+0x107/0x140 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] vfs_write+0xc0/0x1f0 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] SyS_write+0x7f/0xf0 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] system_call_fastpath+0x1c/0x21 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: INFO: task iscsi_ttx:1528 blocked for more than 120 seconds. May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: iscsi_ttx D ffff9df7b194dee0 0 1528 2 0x00000084 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: Call Trace: May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule+0x29/0x70 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule_timeout+0x239/0x2c0 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? list_del+0xd/0x30 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] wait_for_completion+0xfd/0x140 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_state+0x20/0x20 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] transport_generic_free_cmd+0xa2/0x150 [target_core_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_free_cmd+0x82/0x140 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_close_connection+0x4c6/0x8a0 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_take_action_for_connection_exit+0x8b/0x120 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsi_target_tx_thread+0x1f2/0x240 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_atomic_t+0x30/0x30 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? iscsit_thread_get_cpumask+0xf0/0xf0 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] kthread+0xd1/0xe0 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? insert_kthread_work+0x40/0x40 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ret_from_fork_nospec_begin+0x21/0x21 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? insert_kthread_work+0x40/0x40

I think these may be unrelated but interesting.

May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.539 852 [WARN] tcmu_block_device:395 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action. May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_block_device:395 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action. May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.555 852 [WARN] tcmu_rbd_lock:735 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Acquired exclusive lock. May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_rbd_lock:735 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Acquired exclusive lock. May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_unblock_device:418 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action. May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.563 852 [WARN] tcmu_unblock_device:418 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action.

May 25 17:06:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:06:22.619 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:06:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:16:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:16:22.546 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:16:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:30:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:30:57.074 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:30:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:36:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:36:22.638 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:36:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:46:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:46:22.545 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:46:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 18:00:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 18:00:57.089 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 18:00:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported

Jazz9 commented 6 years ago

A little bit more info... Running - tcmu-runner-1.3.0-89.g7d0023e.el7.x86_64

When the lockup occurs you can't run systemctl restart tcmu-runner as it never returns. I believe that is what is preventing reboot also.

Jazz9 commented 6 years ago

Also noted that I am running the Centos 7.5 standard install of targetcli. Its version is slightly lower then the requirements on the ceph master docs. Do you think this is the issue and do you know where a pre compiled newer version may be found?

targetcli.noarch 2.1.fb46-4.el7_5

mikechristie commented 6 years ago

You are using the RHEL/Centos 7.5 kernel right?

It looks like you are hitting a bug in that kernel. It is missing this patch:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/target?id=527268df31e57cf2b6d417198717c6d6afdb1e3e

We are working on getting a fix released. It is merged in our internal repos and being QA'd right now. It should be released in about a month.

I will update this issue with the updated kernel version.

Jazz9 commented 6 years ago

Ok, that sounds good. If you want me to test it - I can. It seems that both iscsi nodes crash every few days.

Jazz9 commented 6 years ago

and yes - Centos 7.5 with latest updates.

Jazz9 commented 6 years ago

Did this get merged?

mikechristie commented 6 years ago

It is in kernel-3.10.0-862.5 which should be released soon.

I do not know how long it takes to go from a RHEL z stream to Centos kernel though.

Jazz9 commented 6 years ago

Seems the latest Centos patches helped. No more crashes.

VRAseb commented 6 years ago

Hello,

We have the same error with a 4.18.9-1 kernel with a single gateway (without multipathing) and ceph 13.2.0 . Has this error been identified? Is there a patch?

Sep 25 07:19:00 src-ceph-gwiscsi1 tcmu-runner: 2018-09-25 07:19:00.359 1204 [DEBUG] alua_implicit_transition:568 rbd/rbd.VMW_CEPH_DS08: lock state 1 Sep 25 07:19:09 src-ceph-gwiscsi1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 71303168 Sep 25 07:20:01 src-ceph-gwiscsi1 systemd: Started Session 190 of user root. Sep 25 07:20:01 src-ceph-gwiscsi1 systemd: Starting Session 190 of user root. Sep 25 07:20:27 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:20:27 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:20:43 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:20:43 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:20:58 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:20:58 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:21:14 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:21:14 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:21:29 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:21:29 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:21:44 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:21:44 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: INFO: task kworker/u256:2:95199 blocked for more than 120 seconds. Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: Not tainted 4.18.9-1.el7.elrepo.x86_64 #1 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: kworker/u256:2 D 0 95199 2 0x80000080 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: Workqueue: tmr-user target_tmr_work [target_core_mod] Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: Call Trace: Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: schedule+0x2ab/0x880 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? kvasprintf+0x7f/0xd0 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: schedule+0x36/0x80 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: schedule_timeout+0x161/0x300 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? __next_timer_interrupt+0xe0/0xe0 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: wait_for_completion_timeout+0x130/0x1a0 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? wake_up_q+0x80/0x80 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: transport_wait_for_tasks+0xdd/0x140 [target_core_mod] Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: transport_wait_for_tasks+0x53/0x90 [target_core_mod] Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: core_tmr_abort_task+0xfb/0x180 [target_core_mod] Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: target_tmr_work+0x128/0x140 [target_core_mod] Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: process_one_work+0x179/0x390 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: worker_thread+0x4f/0x3e0 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: kthread+0x105/0x140 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? max_active_store+0x80/0x80 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? kthread_bind+0x20/0x20 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ret_from_fork+0x35/0x40 Sep 25 07:22:00 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:22:00 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:22:15 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:22:15 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:22:30 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:22:30 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:22:46 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:22:46 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:23:01 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:23:01 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:23:16 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:23:16 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:23:32 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:23:32 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:23:47 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:23:47 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:23:53 src-ceph-gwiscsi1 kernel: INFO: task kworker/0:0:111368 blocked for more than 120 seconds.

mikechristie commented 6 years ago

@VRAseb

You are hitting a similar issue in that a command is taking a long time which leads to the initiator timing it out and sending an abort.

You not hitting the same issue that was being discussed in this GH issue though. The other user was hitting a issue that only occurred in the RHEL kernel because it was missing a patch. Because of this even when the command finally unjammed it self and completed the kernel hit a bug and would never unjam.

In your case from the logs above it looks like the command just never completed. You would want to look in the ceph logs to check if something happened to the cluster or if tcmu-runner crashed so it could not complete the command.

VRAseb commented 6 years ago

Hello, the incident we encounter blocks the service rbd-target-gw. We can not restart the rbd-target-gw service. We can not restart OS, the kernel is blocked. We have no alerts in CEPH journals (HEALTH_OK)

We tried with the tcmu-runner 1.4.0. We have the same incident.

Can you help us?

mikechristie commented 6 years ago

The rbd-target-gw service and kernel are blocked on the stuck command.

Is tcmu-runner running (just do a 'ps -u root | grep runner')?

VRAseb commented 6 years ago

Hello, tcmu-runner does not run after crash.

mikechristie commented 6 years ago

tcmu-runner does not run after crash.

Do you mean when you see the kernel message about the hung_task_timeout_sec, tcmu-runner is not running still? Has it crashed? If so, could you get me the core dump if there is one or get the /var/log/messages around the time tcmu-runner crashes so we can see if there is a stack dump from that? Could you also get me the /var/log/tcmu-runner.log for around the same time?

VRAseb commented 6 years ago

Hello,

messages : Oct 4 05:45:01 src-ceph-gwiscsi1 systemd: Started Session 283 of user root. Oct 4 05:45:01 src-ceph-gwiscsi1 systemd: Starting Session 283 of user root. Oct 4 05:49:25 src-ceph-gwiscsi1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 97406060 Oct 4 05:49:35 src-ceph-gwiscsi1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 97406062 Oct 4 05:50:01 src-ceph-gwiscsi1 systemd: Started Session 284 of user root. Oct 4 05:50:01 src-ceph-gwiscsi1 systemd: Starting Session 284 of user root. Oct 4 05:51:02 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:51:02 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:51:02 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:51:17 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:51:17 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:51:17 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:51:33 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:51:33 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:51:33 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:51:48 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:51:48 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:51:48 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:52:03 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:52:03 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:52:03 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:52:19 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:52:19 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:52:19 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:52:34 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:52:34 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:52:34 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: INFO: task iscsi_trx:2513 blocked for more than 120 seconds. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Not tainted 4.18.9-1.el7.elrepo.x86_64 #1 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_trx D 0 2513 2 0x80000080 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Call Trace: Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule+0x2ab/0x880 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? kvasprintf+0x7f/0xd0 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule+0x36/0x80 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule_timeout+0x161/0x300 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? __next_timer_interrupt+0xe0/0xe0 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: wait_for_completion_timeout+0x130/0x1a0 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? wake_up_q+0x80/0x80 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: transport_wait_for_tasks+0xdd/0x140 [target_core_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: target_wait_free_cmd+0x56/0x90 [target_core_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: transport_generic_free_cmd+0x103/0x140 [target_core_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_free_cmd+0x43/0x90 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_close_connection+0x586/0x8f0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_rx_thread+0xe2/0xf0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: kthread+0x105/0x140 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? iscsi_target_tx_thread+0x1f0/0x1f0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? kthread_bind+0x20/0x20 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ret_from_fork+0x35/0x40 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: INFO: task kworker/0:0:164935 blocked for more than 120 seconds. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Not tainted 4.18.9-1.el7.elrepo.x86_64 #1 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: kworker/0:0 D 0 164935 2 0x80000080 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Workqueue: events iscsi_target_do_login_rx [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Call Trace: Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: __schedule+0x2ab/0x880 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule+0x36/0x80 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule_timeout+0x1dc/0x300 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: wait_for_completion+0x121/0x180 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? wake_up_q+0x80/0x80 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_stop_session+0x1b1/0x1c0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_check_for_session_reinstatement+0x1ed/0x280 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_check_for_existing_instances+0x30/0x40 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_do_login+0x31d/0x5f0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_do_login_rx+0x26f/0x340 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? iscsi_target_check_login_request+0x1a0/0x1a0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: process_one_work+0x179/0x390 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: worker_thread+0x4f/0x3e0 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: kthread+0x105/0x140 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? max_active_store+0x80/0x80 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? kthread_bind+0x20/0x20 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ret_from_fork+0x35/0x40 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.

The tcmu log stops at the time of the crash.

mikechristie commented 6 years ago

I think we are talking about different things. The log snippet you posted above is not a crash. It is a warning indicating that a command is stuck.

I am asking if when you see this is tcmu-runner running? If you run

systemctl status tcmu-runner

does it show it is running?

If it is then the command is could be stuck in the ceph cluster somewhere. You then should check the /var/log/ceph logs on the OSDs to see if there are any errors in there.

Also what tools have you used to set this up? Did you use the ceph-iscsi tools or did you run targetlci manually?

mikechristie commented 6 years ago

Oh yeah, if you are setting it up yourself with targetcli set the osd_op_timeout to something like 15 seconds. This will force most commands to fail if they have not completed within that timeout. We can then verify that they are getting sent to the OSD and getting stuck there.

manens commented 6 years ago

Hi @mikechristie, I investigated this case and this is what I can tell :

From what I see, I agree with your first idea that tcmu crashed and never complete the command.

Could you give instruction to get the core dump of tcmu when the problem happen ? What can we provide to help ?