SoftRoCE / librxe-dev

Development repository for RXE user space code.
Other
66 stars 33 forks source link

SoftRoCE stops responding to ConnectRequest after 1000 successful connections, rdma_rxe kernel error #14

Closed Nik-Sch closed 4 years ago

Nik-Sch commented 6 years ago

Setup: I have a host and a VirtualBox System, both with installed SoftRoCE on a current Ubuntu. I am implementing a client/server model where the client connects to the server, they exchange basic information like size and a memory region and then the client transfers a file via RDMA Write to the server. After the transfer finished the server disconnects from the client. For clarification I attached a screenshot from Wireshark capturing two such file transfers where 192.168.99.1 is the "client" and 192.168.99.102 is the "server". However, when I tried some stress testing, at some point the server just did not respond to a ConnectionRequest at all. The strsss test looked like following: A script executed the program described above consecutively with different files and file sizes. This works pretty good for about the first 1000 files. However, after that the client tries to connect to the server (in wireshark I see the ConnectRequests) but the server doesn't respond at all and eventually the client gives up (after defined 16 cm retries). With dmesg I got the following kernel error:

[Fr Jun  8 13:52:35 2018] Modules linked in: rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm rdma_cm iw_cm ib_cm ib_uverbs ib_core sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd_intel8x0 snd_ac97_codec input_leds ac97_bus aesni_intel snd_pcm joydev snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd vboxvideo ttm drm_kms_helper vboxguest aes_x86_64 mac_hid drm i2c_piix4 fb_sys_fops syscopyarea sysfillrect soundcore sysimgblt serio_raw crypto_simd glue_helper cryptd intel_rapl_perf parport_pc ppdev lp parport autofs4 hid_generic usbhid hid psmouse e1000 ahci libahci video pata_acpi
[Fr Jun  8 13:52:35 2018] CPU: 5 PID: 18644 Comm: server Tainted: G      D         4.13.0-38-generic #43~16.04.1-Ubuntu
[Fr Jun  8 13:52:35 2018] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[Fr Jun  8 13:52:35 2018] task: ffff9fe3495b5d00 task.stack: ffffb448015e4000
[Fr Jun  8 13:52:35 2018] RIP: 0010:_etext+0x10/0x20
[Fr Jun  8 13:52:35 2018] RSP: 0018:ffffb448015e7ba0 EFLAGS: 00010286
[Fr Jun  8 13:52:35 2018] RAX: ff000000ff000000 RBX: ffff9fe352954c40 RCX: 0000000000000000
[Fr Jun  8 13:52:35 2018] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff9fe3529544a0
[Fr Jun  8 13:52:35 2018] RBP: ffffb448015e7bc0 R08: ffff9fe35fcdae20 R09: ffff9fe3537106d8
[Fr Jun  8 13:52:35 2018] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9fe3529544a0
[Fr Jun  8 13:52:35 2018] R13: ffff9fe352954870 R14: ffff9fe3529544a0 R15: ffff9fe3498df120
[Fr Jun  8 13:52:35 2018] FS:  00007f7b63d71740(0000) GS:ffff9fe35fd40000(0000) knlGS:0000000000000000
[Fr Jun  8 13:52:35 2018] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fr Jun  8 13:52:35 2018] CR2: 00000000019c3818 CR3: 0000000212a68004 CR4: 00000000000606e0
[Fr Jun  8 13:52:35 2018] Call Trace:
[Fr Jun  8 13:52:35 2018]  ? rxe_elem_release+0x25/0x60 [rdma_rxe]
[Fr Jun  8 13:52:35 2018]  rxe_requester+0x6ad/0x1220 [rdma_rxe]
[Fr Jun  8 13:52:35 2018]  ? check_preempt_wakeup+0xfb/0x240
[Fr Jun  8 13:52:35 2018]  ? lock_timer_base+0x7d/0xa0
[Fr Jun  8 13:52:35 2018]  __rxe_do_task+0x1a/0x30 [rdma_rxe]
[Fr Jun  8 13:52:35 2018]  rxe_qp_destroy+0x61/0xa0 [rdma_rxe]
[Fr Jun  8 13:52:35 2018]  rxe_destroy_qp+0x22/0x50 [rdma_rxe]
[Fr Jun  8 13:52:35 2018]  ib_destroy_qp+0x128/0x210 [ib_core]
[Fr Jun  8 13:52:35 2018]  uverbs_free_qp+0x37/0xa0 [ib_uverbs]
[Fr Jun  8 13:52:35 2018]  remove_commit_idr_uobject+0x23/0x70 [ib_uverbs]
[Fr Jun  8 13:52:35 2018]  _rdma_remove_commit_uobject+0x2a/0xc0 [ib_uverbs]
[Fr Jun  8 13:52:35 2018]  rdma_remove_commit_uobject+0x34/0x60 [ib_uverbs]
[Fr Jun  8 13:52:35 2018]  ib_uverbs_destroy_qp+0x7b/0xe0 [ib_uverbs]
[Fr Jun  8 13:52:35 2018]  ib_uverbs_write+0x198/0x400 [ib_uverbs]
[Fr Jun  8 13:52:35 2018]  ? common_file_perm+0x54/0x110
[Fr Jun  8 13:52:35 2018]  ? tty_write+0x1d4/0x2f0
[Fr Jun  8 13:52:35 2018]  ? apparmor_file_permission+0x1a/0x20
[Fr Jun  8 13:52:35 2018]  __vfs_write+0x1b/0x40
[Fr Jun  8 13:52:35 2018]  vfs_write+0xb8/0x1b0
[Fr Jun  8 13:52:35 2018]  ? entry_SYSCALL_64_after_hwframe+0xb1/0x139
[Fr Jun  8 13:52:35 2018]  SyS_write+0x55/0xc0
[Fr Jun  8 13:52:35 2018]  ? entry_SYSCALL_64_after_hwframe+0x79/0x139
[Fr Jun  8 13:52:35 2018]  entry_SYSCALL_64_fastpath+0x24/0xab
[Fr Jun  8 13:52:35 2018] RIP: 0033:0x7f7b6352b4bd
[Fr Jun  8 13:52:35 2018] RSP: 002b:00007ffe4ec67190 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[Fr Jun  8 13:52:35 2018] RAX: ffffffffffffffda RBX: 00007f7b63515b20 RCX: 00007f7b6352b4bd
[Fr Jun  8 13:52:35 2018] RDX: 0000000000000018 RSI: 00007ffe4ec671b0 RDI: 0000000000000004
[Fr Jun  8 13:52:35 2018] RBP: 0000000000001051 R08: 65746f6d6572206d R09: 0000000000000001
[Fr Jun  8 13:52:35 2018] R10: 00000000000000c2 R11: 0000000000000293 R12: 00007f7b63515b78
[Fr Jun  8 13:52:35 2018] R13: 00007f7b63515b78 R14: 000000000000270e R15: 00007f7b63516218
[Fr Jun  8 13:52:35 2018] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 e8 07 00 00 00 f3 90 0f ae e8 eb f9 48 89 04 24 <c3> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e8 07 00 00 00 
[Fr Jun  8 13:52:35 2018] RIP: _etext+0x10/0x20 RSP: ffffb448015e7ba0
[Fr Jun  8 13:52:35 2018] ---[ end trace e3c1d563e75cd22a ]---
[Fr Jun  8 13:53:05 2018] rdma_rxe: element already exists!
[Fr Jun  8 13:53:06 2018] rdma_rxe: no qp matches qpn 0x1
[Fr Jun  8 13:53:13 2018] rdma_rxe: no qp matches qpn 0x1
[Fr Jun  8 13:53:19 2018] rdma_rxe: no qp matches qpn 0x1
[Fr Jun  8 13:53:25 2018] rdma_rxe: no qp matches qpn 0x1
[...]

Trying to restart rxe with rxe_cfg stop && rxe_cfg start doesn't work because rxe_cfg stop times out. When rebooting the server (shutting down times out, so I unplug the power and reboot), however, everything works as expected again.

Does anyone have an idea what the problem is or how I could fix it? If needed I surely can supply further system information on monday.

roce_file_transfer

cobbwho commented 4 years ago

Hi, I have a problem similar to yours. Did you find a solution?

When I test softroce with ib_write_bw, I can get the right average bandwidth after the test. But then the network card seems to crash, I can't use commands like ifconfig or rxe_cfg, and ssh is still alive. The reboot command does not work, but a physical restart works, just like you mentioned.

Nik-Sch commented 4 years ago

No I was not able to solve it. I was also having some bandwidth problems in the VirtualBox Environment and as the virtualization was just early testing I just proceeded to "real" servers and the problem was gone. In general I experienced that these kernel modules were not perfectly working in virtualized environment but I cannot directly show any dmesg outputs or similar. I have also seen a lot of stability improvements in the new Mellanox OFED drivers (4.7) which render rxe_cfg deprecated in favor of other tools provided (Mellanox docs).

cobbwho commented 4 years ago

Hi, bro. Thank you so much for your prompt answer. Although I still don't know why it caused the crash, I found a way to avoid the computer crash. The point is the state of the computer. For my program testing process, when I use this command ./rxe_cfg start , any new ssh connection will cause the computer to crash(the reboot command does not work, you must restart it manually). But as long as the computer does not create a new ssh connection, there will be no problems. In addition, when I use this rmmod module or this ./rxe_cfg stop after, the new ssh connection does not cause the computer to crash.

Keepmoving-ZXY commented 3 years ago

I encounter the same error when run in a non root linux user.

Keepmoving-ZXY commented 3 years ago

I encounter the same error when run in a non root linux user.

And no error occurs when run code in root user.