LINBIT / drbd

LINBIT DRBD kernel module
https://docs.linbit.com/docs/users-guide-9.0/
GNU General Public License v2.0
587 stars 100 forks source link

drbd_transport_rdma incompatible with OFED drivers #59

Closed wzrdtales closed 10 months ago

wzrdtales commented 1 year ago

Despite https://kb.linbit.com/enabling-rdma-support-in-linux telling that it should work with OFED, the drbd dkms is not actually compiled against the correct headers.

Used ppa.launchpad.net/linbit/linbit-drbd9-stack/

drbd_transport_rdma: disagrees about version of symbol ib_dealloc_pd_user
drbd_transport_rdma: Unknown symbol ib_dealloc_pd_user (err -22)

even if building from source and correcting this

os-hv-am-3: conn( Unconnected -> Connecting )
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.703350] BUG: kernel NULL pointer dereference, address: 0000000000000000
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.704772] #PF: supervisor instruction fetch in kernel mode
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.706154] #PF: error_code(0x0010) - not-present page
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.707532] PGD 0 P4D 0 
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.708902] Oops: 0010 [#2] SMP NOPTI
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.710264] CPU: 30 PID: 887 Comm: kworker/30:1 Tainted: G      D    OE     5.4.0-150-generic #167-Ubuntu
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.711671] Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.7.3 03/31/2022
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.713101] Workqueue: events dtr_cma_connect_work_fn [drbd_transport_rdma]
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.714509] RIP: 0010:0x0
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.715854] Code: Bad RIP value.
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.717178] RSP: 0018:ffffb67bc247bbd8 EFLAGS: 00010246
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.718464] RAX: 0000000000000000 RBX: ffff9b040b142000 RCX: 0000000000008000
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.719732] RDX: ffffb67bc247bc08 RSI: ffffb67bc247bc90 RDI: ffff9b04216e0000
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.720973] RBP: ffffb67bc247bde8 R08: 0000000000000014 R09: 8080808080808080
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.722181] R10: ffff9b04bcefb980 R11: 0000000000102000 R12: 0000000000028000
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.723357] R13: ffff9b042427a200 R14: ffff9b040b142000 R15: ffff9b042427a200
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.724503] FS:  0000000000000000(0000) GS:ffff9b04fdb80000(0000) knlGS:0000000000000000
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.725635] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.726735] CR2: ffffffffffffffd6 CR3: 000000bc9abf0000 CR4: 0000000000340ee0
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.727819] Call Trace:
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.728899]  dtr_cm_alloc_rdma_res+0x87/0x5c0 [drbd_transport_rdma]
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.730003]  ? update_dl_rq_load_avg+0x1d7/0x2c0
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.731104]  ? sched_clock_cpu+0x11/0xb0
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.732207]  ? dbs_update_util_handler+0x1b/0x80
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.733307]  ? cpufreq_dbs_governor_start+0x180/0x180
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.734400]  ? update_blocked_averages+0x11c/0x590
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.735487]  ? sched_clock+0x9/0x10
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.736565]  ? update_load_avg+0x7c/0x670
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.737639]  ? update_load_avg+0x7c/0x670
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.738715]  ? set_next_entity+0xb5/0x200
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.739775]  dtr_path_prepare+0x111/0x240 [drbd_transport_rdma]
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.740816]  dtr_cma_connect_work_fn+0x93/0x180 [drbd_transport_rdma]
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.741856]  process_one_work+0x1eb/0x3b0
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.742864]  worker_thread+0x4d/0x400
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.743861]  kthread+0x104/0x140
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.744845]  ? process_one_work+0x3b0/0x3b0
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.745823]  ? kthread_park+0x90/0x90
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.746797]  ret_from_fork+0x35/0x40
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.747763] Modules linked in: drbd_transport_rdma(OE) drbd(OE) lru_cache nf_conntrack_netlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter aufs cuse overlay rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) esp6_offload esp6 esp4_offload esp4 xfrm_algo mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx4_ib(OE) ib_core(OE) nls_iso8859_1 ipmi_ssif amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul mgag200 drm_vram_helper ttm ghash_clmulni_intel drm_kms_helper dell_smbios input_leds joydev i2c_algo_bit aesni_intel fb_sys_fops dcdbas syscopyarea crypto_simd cryptd sysfillrect dell_wmi_descriptor wmi_bmof sysimgblt glue_helper ccp k10temp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid bridge stp llc
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.747802]  bonding sch_fq_codel ramoops knem(OE) reed_solomon drm efi_pstore ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx4_en(OE) hid_generic usbhid hid raid1 mlx5_core(OE) crc32_pclmul mlx4_core(OE) tg3 ahci nvme libahci tls nvme_core mlxfw(OE) mlx_compat(OE) i2c_piix4 wmi
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.761929] CR2: 0000000000000000
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.763213] ---[ end trace 7f4eaf44141a7de5 ]---
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.850615] RIP: 0010:0x0
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.851892] Code: Bad RIP value.
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.853137] RSP: 0018:ffffb67bc24d3bd8 EFLAGS: 00010246
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.854381] RAX: 0000000000000000 RBX: ffff9b040b142000 RCX: 0000000000008000
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.855632] RDX: ffffb67bc24d3c08 RSI: ffffb67bc24d3c90 RDI: ffff9b04216e0000
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.856894] RBP: ffffb67bc24d3de8 R08: 0000000000000014 R09: 8080808080808080
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.858156] R10: ffff9b04bcefb980 R11: 0000000000102000 R12: 0000000000028000
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.859410] R13: ffff9b049f822c00 R14: ffff9b040b142000 R15: ffff9b049f822c00
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.860661] FS:  0000000000000000(0000) GS:ffff9b04fdb80000(0000) knlGS:0000000000000000
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.861926] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 12 20:28:11 os-hv-am-4 kernel: [  245.863190] CR2: ffffffffffffffd6 CR3: 000000bc9abf0000 CR4: 0000000000340ee0

it seems to be non compatible

Philipp-Reisner commented 10 months ago

In the meantime, we made building with OFED easier and fixed several bugs in drbd_transport_rdma. I have people using it with and without OFED with cards between Connect-X3 and Connect-X6.

wzrdtales commented 10 months ago

great, will retest soon

Philipp-Reisner commented 10 months ago

Make sure to test with drbd 9.2.7