Mellanox / libvma

Linux user space library for network socket acceleration based on RDMA compatible network adaptors
https://www.mellanox.com/products/software/accelerator-software/vma?mtag=vma
Other
582 stars 153 forks source link

VMA ERROR: vlist[0x7fbc567fcad0]:302:push_back() Buff is already a member in a list! #974

Open syspro4 opened 2 years ago

syspro4 commented 2 years ago

Subject

VMA ERROR: vlist[0x7fbc567fcad0]:302:push_back() Buff is already a member in a list!

Issue type

Configuration:

Actual behavior:

I configured GlusterFS & configured gluster volume and then mounted that gluster volume from a host using glusterfs fuse protocol and ran fio with rw=read and I started seeing following VMA Errors.

fio command: fio --error_dump=1 --direct=1 --verify_dump=1 --ioengine=libaio --size=100G --name=tt --bs=1M --nrfiles=8 --iodepth=8 --directory=/mnt_vol --rw=read --time_based=1 --runtime=120

Note: Same fio command with rw=write worked perfectly fine.

VMA INFO: --------------------------------------------------------------------------- VMA INFO: VMA_VERSION: 9.3.1-1 Release built on Oct 9 2021 11:01:45 VMA INFO: Cmd Line: /usr/sbin/glusterfsd -s 192.168.2.244 --volfile-id ns1.192.168.2.244.mnt-vol -p /var/run/gluster/vols/ns1/192.168.2.244-mnt-vol.pid -S /var/run/gluster/834a347a5e8d7a50.socket --brick-name /mnt/vol -l /var/log/glusterfs/bricks/mnt-vol.log --xlator-option *-posix.glusterd-uuid=9e6c297d-dc87-4866-89b6-8ada6d5d35eb --process-name brick --brick-port 49152 --xlator-option ns1-server.listen- VMA INFO: --------------------------------------------------------------------------- VMA INFO: Log Level INFO [VMA_TRACELEVEL] VMA INFO: --------------------------------------------------------------------------- VMA ERROR: vlist[0x7fbc567fcad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc56ffdad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc44c44ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc567fcad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45445ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc46447ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc46447ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc567fcad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45445ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc46447d70]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc46c48ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc46447ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45445ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc56ffdd70]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45c46ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45c46ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45445ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45c46ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45445ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc37ffdad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc56ffdad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45445ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45c46ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc56ffdad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc37ffdad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc56ffdad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45445ad0]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc56ffdad0]:302:push_back() Buff is already a member in a list! ^C VMA ERROR: vlist[0x7fbc46447d70]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc46447d70]:302:push_back() Buff is already a member in a list! VMA ERROR: vlist[0x7fbc45445d70]:302:push_back() Buff is already a member in a list!

Please help me. Thanks in advance.

syspro4 commented 2 years ago

I managed to get the crash dump while running fio with rw=read.
Am I missing any configuration/parameter setting? Please help.

[root@ core]# gdb glfs_epoll004.11.core.80614 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfsd -s 192.168.2.241 --volfile-id ns1.192.168.2.241.mnt-G'. Program terminated with signal SIGSEGV, Segmentation fault.

0 epoll_wait_call::get_current_events (this=this@entry=0x7effaed29e00) at iomux/epoll_wait_call.cpp:149

149 iomux/epoll_wait_call.cpp: No such file or directory. [Current thread is 1 (Thread 0x7effaed2b700 (LWP 80630))] Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-151.0.1.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 krb5-libs-1.18.2-8.el8.x86_64 libacl-2.2.53-1.el8.x86_64 libaio-0.3.112-1.el8.x86_64 libattr-2.4.48-3.el8.x86_64 libcom_err-1.45.6-1.el8.x86_64 libgcc-8.4.1-1.0.1.el8.x86_64 libibverbs-32.0-4.el8.x86_64 libnl3-3.5.0-1.el8.x86_64 librdmacm-32.0-4.el8.x86_64 libstdc++-8.4.1-1.0.1.el8.x86_64 libtirpc-1.1.4-4.el8.x86_64 liburing-1.0.7-3.el8.x86_64 libuuid-2.32.1-27.el8.x86_64 libvma-9.3.1-1.el8.x86_64 openssl-libs-1.1.1g-15.el8_3.x86_64 pcre2-10.32-2.el8.x86_64 sssd-client-2.4.0-9.0.1.el8.x86_64 userspace-rcu-0.11.1-3.fc32.x86_64 zlib-1.2.11-17.el8.x86_64

(gdb) bt

0 epoll_wait_call::get_current_events (this=this@entry=0x7effaed29e00) at iomux/epoll_wait_call.cpp:149

1 0x00007f00adc88f1c in epoll_wait_helper (epfd=, events=events@entry=0x7effaed29f94, maxevents=maxevents@entry=1, timeout=timeout@entry=-1, sigmask=__sigmask@entry=0x0) at sock/sock-redirect.cpp:2440

2 0x00007f00adc88fe8 in epoll_wait (epfd=, events=events@entry=0x7effaed29f94, maxevents=maxevents@entry=1, timeout=__timeout@entry=-1) at sock/sock-redirect.cpp:2461

3 0x00007f00ad904732 in event_dispatch_epoll_worker (data=0x7effb0006560) at event-epoll.c:741

4 0x00007f00ac42715a in start_thread () from /lib64/libpthread.so.0

5 0x00007f00abc70dd3 in clone () from /lib64/libc.so.6

(gdb)

Following is the code snippet for line 149: 76 int epoll_wait_call::get_current_events() 77 { ... 138 / 139 for checking ring migration we need a socket context. 140 in epoll we separate the rings from the sockets, so only here we access the sockets. 141 therefore, it is most convenient to check it here. 142 we need to move the ring migration to the epfd, going over the registered sockets, 143 when polling the rings was not fruitful. 144 this will be more similar to the behavior of select/poll. 145 see RM task 212058 146 / 147 while (!socket_fd_list.empty()) { 148 socket_fd_api sockfd = socket_fd_list.get_and_pop_front(); 149 sockfd->consider_rings_migration(); 150 }

Thanks!

syspro4 commented 2 years ago

Can some please share some update on this issue?

igor-ivanov commented 2 years ago

I would recommend setting VMA_TRACELEVEL=4 and look or share debug output. You can try to launch your application with extra VMA option as VMA_RING_MIGRATION_RATIO_TX=-1 VMA_RING_MIGRATION_RATIO_RX=-1

igor-ivanov commented 2 years ago

@syspro4 do you see the issue with VMA_RING_MIGRATION_RATIO_TX=-1 VMA_RING_MIGRATION_RATIO_RX=-1 ?