analogdevicesinc / libiio

A cross platform library for interfacing with local and remote Linux IIO devices
http://analogdevicesinc.github.io/libiio/
GNU Lesser General Public License v2.1
488 stars 316 forks source link

iio_buffer_destroy() sometimes hangs when items are enqueued #1211

Open catkira opened 1 day ago

catkira commented 1 day ago

This is caused by iiod_responder_wait_done() in iiod-responder.c. The thread is probably not stopped correctly and therefore cannot join.

Testscenario: latest master libiio, network context to sdr.

catkira commented 1 day ago

The problem seems to be that iiod_rw_all() is not returning.

@pcercuei do you have a quick idea how to fix this?

catkira commented 1 day ago

The underlying problem seems to be that recv() is blocking https://github.com/analogdevicesinc/libiio/blob/790a6c9324c775e38ab319534d97f0e3253336ff/network.c#L106C1-L106C50

catkira commented 1 day ago

void do_cancel(struct iiod_client_pdata *io_ctx) in network-unix.c gets called, but it doesnt seem to abort the recv() call ^^

nunojsa commented 1 day ago

void do_cancel(struct iiod_client_pdata *io_ctx) in network-unix.c gets called, but it doesnt seem to abort the recv() call ^^

I guess that only matters for cancellable operations. If that flag is set to true, do_cancel() should do the trick. But if we are not getting into poll(2), then that cancel_fd should not matter.

But in that case MSG_DONTWAIT should be set and recv(2) should not block...

catkira commented 1 day ago

cancellable &= !(flags & MSG_DONTWAIT); evaluates to true in my case

catkira commented 1 day ago

it seems likewait_cancellable(io_ctx, true, timeout_ms); is blocking in my case and never returns

catkira commented 1 day ago

something with the cancel_fd is not working

nunojsa commented 1 day ago

Yeah, it seems there's some subtle bug in that logic... Because do_cancel() should do the trick

catkira commented 1 day ago

The problem is pfd[1].revents is always 0 even if cancel was called.

read...
network_recv...
cancellable = 1
wait_cancellable...
destroying...
iio_buffer_cancel...
task_stop
task stop: 0 0 id: 28940624
do cancel...
do cancel done
iio_buffer_cancel done
free_buffer...
free1
pfd[1].revents = 0
task stop: 0 1 id: 28626704
catkira commented 1 day ago

something is wrong with the cancel_fd, it has wrong value when poll() is called after cancelling the buffer.

catkira commented 1 day ago

the other cancel_fd is from the other buffer that my application has open, so thats not the problem. I have to debug a bit more...

catkira commented 20 hours ago

its a bit difficult to debug, because every iio context and every buffer has its own thread which all use functions from network.c and there is no real documentation about the internals of libiio ^^

catkira commented 19 hours ago

@nunojsa I have some more findings, it goes into a 100% cpu load endless loop here https://github.com/analogdevicesinc/libiio/blob/790a6c9324c775e38ab319534d97f0e3253336ff/iiod-responder.c#L571-L572

catkira commented 19 hours ago

in c++ this would just be a std::vector and so much easier to debug

catkira commented 19 hours ago

@pcercuei it looks like the same reader is added twice to the list which causes an endless loop

[2024-11-06 11:34:23.757] [control_center::iio_buf_reader] [warning] ---     dequeue: timed out    ---
thread = 1778366016, adding reader 179795776
list item 0 = 181190880
list item 1 = 180965088
list item 2 = 179790880
list item 3 = 180897360
list item 4 = 180966064
list item 5 = 180047712
list item 6 = 180763824
list item 7 = 179200736
list item 8 = 180901488
list item 9 = 181200016
list item 10 = 183120608
list item 11 = 178781568
list item 12 = 181129104
list item 13 = 180895728
list item 14 = 179365760
list item 15 = 181301104
thread = 1786758720, wait_cancellable done
b ts: 0 1
c ts: 0 1
x1
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
x2
task stop: 0 1 id: 31090272
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, read done
thread = 1786758720, read...
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1778366016, adding reader 179795776
list item 0 = 181190880
list item 1 = 180965088
list item 2 = 179790880
list item 3 = 180897360
list item 4 = 180966064
list item 5 = 180047712
list item 6 = 180763824
list item 7 = 179200736
list item 8 = 180901488
list item 9 = 181200016
list item 10 = 183120608
list item 11 = 178781568
list item 12 = 181129104
list item 13 = 180895728
list item 14 = 179365760
list item 15 = 181301104
list item 16 = 178759280
thread = 1778366016, discard reader 181190880
[2024-11-06 11:34:23.758] [control_center::iio_buf_reader] [warning] ---     dequeue: timed out    ---
b ts: 0 1
c ts: 0 1
x1
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
thread = 1778366016, adding reader 181190880
list item 0 = 180965088
list item 1 = 179790880
list item 2 = 180897360
list item 3 = 180966064
list item 4 = 180047712
list item 5 = 180763824
list item 6 = 179200736
list item 7 = 180901488
list item 8 = 181200016
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
x2
list item 9 = 183120608
list item 10 = 178781568
list item 11 = 181129104
task stop: 0 1 id: 31090272
list item 12 = 180895728
list item 13 = 179365760
list item 14 = 181301104
list item 15 = 178759280
list item 16 = 179795776
list item 17 = 179795776
list item 18 = 179795776
list item 19 = 179795776
list item 20 = 179795776
list item 21 = 179795776
list item 22 = 179795776
list item 23 = 179795776
list item 24 = 179795776
list item 25 = 179795776
list item 26 = 179795776
list item 27 = 179795776
list item 28 = 179795776
list item 29 = 179795776
list item 30 = 179795776
list item 31 = 179795776
list item 32 = 179795776
list item 33 = 179795776
list item 34 = 179795776
list item 35 = 179795776
list item 36 = 179795776
list item 37 = 179795776
list item 38 = 179795776
list item 39 = 179795776
list item 40 = 179795776
list item 41 = 179795776
list item 42 = 179795776
list item 43 = 179795776
list item 44 = 179795776
list item 45 = 179795776
list item 46 = 179795776
list item 47 = 179795776
list item 48 = 179795776
list item 49 = 179795776
list item 50 = 179795776
list item 51 = 179795776
list item 52 = 179795776
list item 53 = 179795776
list item 54 = 179795776
list item 55 = 179795776
list item 56 = 179795776
list item 57 = 179795776
list item 58 = 179795776
list item 59 = 179795776
list item 60 = 179795776
list item 61 = 179795776
list item 62 = 179795776
list item 63 = 179795776
list item 64 = 179795776
list item 65 = 179795776
list item 66 = 179795776
list item 67 = 179795776
list item 68 = 179795776
list item 69 = 179795776
list item 70 = 179795776
list item 71 = 179795776
list item 72 = 179795776
list item 73 = 179795776
list item 74 = 179795776
list item 75 = 179795776
list item 76 = 179795776
list item 77 = 179795776
list item 78 = 179795776
list item 79 = 179795776
list item 80 = 179795776
list item 81 = 179795776
list item 82 = 179795776
list item 83 = 179795776
list item 84 = 179795776
list item 85 = 179795776
list item 86 = 179795776
list item 87 = 179795776
list item 88 = 179795776
list item 89 = 179795776
list item 90 = 179795776
list item 91 = 179795776
list item 92 = 179795776
list item 93 = 179795776
list item 94 = 179795776
list item 95 = 179795776
list item 96 = 179795776
list item 97 = 179795776
list item 98 = 179795776
list item 99 = 179795776
list item 100 = 179795776
    /* Add it to the readers list */
    printf("thread = %u, adding reader %d\n", pthread_self(), io);
    if (!priv->readers) {
        priv->readers = io;
    } else {
        unsigned int i = 0;
        for (tmp = priv->readers; tmp->r_next; ) {
            printf("list item %d = %d\n", i++,tmp);
            tmp = tmp->r_next;
            if (i > 100) exit(1);
        }
        tmp->r_next = io;
    }
nunojsa commented 17 hours ago

it looks like the same reader is added twice to the list which causes an endless loop

Interesting findings...

We can obviously add a protection so we error out gracefully (or just ignore the entry) if we add an entry that already exists (I guess that should not be possible) even though this is only a real problem if the repeated entries are sequential so that entry->next = entry (which pretty much means list_empty - but not in this case I guess). But more importantly would be to find out why we get the same entry added twice...

@catkira thanks for debugging this

catkira commented 17 hours ago

yes I tried this work around, but then it still hangs sometimes at other locations (probably because of the same underlying reason)

    /* Add it to the readers list */
    printf("thread = %u, adding reader %d\n", pthread_self(), io);
    bool found = false;
    if (priv->readers) {
        for (tmp = priv->readers; tmp->r_next; ) {
            if (tmp == io)
            {
                printf("reader is already in list!!\n");
                found = true;
            }
            tmp = tmp->r_next;
        }
    }
    if (!found) {
        if (!priv->readers) {
            priv->readers = io;
        } else {
            unsigned int i = 0;
            for (tmp = priv->readers; tmp->r_next; ) {
                printf("list item %d = %d\n", i++,tmp);
                tmp = tmp->r_next;
                if (i > 100) exit(1);
            }
            tmp->r_next = io;
        }
    }
catkira commented 17 hours ago

I still get endless loops at the same location even when the same item is not inserted twice

image

maybe memory corruption??

catkira commented 17 hours ago
    /* Add it to the readers list */
    printf("thread = %u, adding reader %d\n", pthread_self(), io);
    bool found = false;
    if (priv->readers) {
        for (tmp = priv->readers; tmp->r_next; ) {
            if (tmp == io)
            {
                printf("reader is already in list!!\n");
                found = true;
                exit(1);
            }
            if (tmp == tmp->r_next) {
                printf("loop detected!!!");
                exit(1);
            }
            tmp = tmp->r_next;
        }
    }
    if (!found) {
        if (!priv->readers) {
            priv->readers = io;
        } else {
            unsigned int i = 0;
            for (tmp = priv->readers; tmp->r_next; ) {
                printf("list item %d = %d\n", i++,tmp);
                tmp = tmp->r_next;
                if (i > 100) exit(1);
            }
            tmp->r_next = io;
        }
    }
thread = 2539071040, adding reader 182425776
loop detected!!!
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, try lock...
thread = 2588911168, locked
thread = 2588911168, discard reader 182140576
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 3178092864, adding reader 182140576
b ts: 0 1
c ts: 0 1
x1
thread = 2597303872, poll cancel_fd 15
thread = 2597303872, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
x2
task stop: 0 1 id: 32557408
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, try lock...
thread = 2588911168, locked
thread = 2588911168, discard reader 182140576
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
terminate called without an active exception
Aborted (core dumped)

it must be memory corruption

nunojsa commented 16 hours ago

it must be memory corruption

Auch... Might very well be but let's hop it's not :). I see your loop detected!!! and looking at iiod_io_get_response_async() I wonder how that could be as we do io->r_next = NULL; and then I started to look at __iiod_io_cancel_unlocked(). Maybe it's worth putting some debug in there. It might be the place where the loop is created...

nunojsa commented 15 hours ago

BTW, I'm not sure if the code you pasted is the one you used but if it is, I think we need to check for an existing object before doing io->r_next = NULL;. Because, if the object already exists, that assignment will cut the list at that point and your loop will never find the repeated object because tmp->r_next == NULL. Right?

So __iiod_io_cancel_unlocked() should not be the problem and adding a struct iiod_io object that is already in the list cannot be permitted...

catkira commented 14 hours ago

I think io->r_next is set to NULL, because it will always be inserted at the end of the list.

catkira commented 14 hours ago

When I change the code that adds items to the reader list like this, it does not hang anymore. This printf("io->r_next = %d\n", io->r_next); printf showed, that the next item was not set to 0, but instead to itself. Manually setting it to 0 seems to fix it. But why do I have to set it to 0, it is already done before io->r_next = NULL; ??

    /* Add it to the readers list */
    printf("thread = %u, adding reader %d\n", pthread_self(), io);
    bool found = false;
    if (priv->readers) {
        for (tmp = priv->readers; tmp->r_next; ) {
            if (tmp == io)
            {
                printf("reader is already in list!!\n");
                found = true;
                exit(1);
            }
            if (tmp == tmp->r_next) {
                printf("loop detected!!!\n");
                exit(1);
            }
            tmp = tmp->r_next;
        }
    }
    if (!found) {
        if (!priv->readers) {
            priv->readers = io;
        } else {
            unsigned int i = 0;
            for (tmp = priv->readers; tmp->r_next; ) {
                printf("list item %d = %d\n", i++,tmp);
                tmp = tmp->r_next;
                if (i > 100) exit(1);
            }
            tmp->r_next = io;
            printf("io->r_next = %d\n", io->r_next);
            io->r_next = NULL;
        }
    }
catkira commented 14 hours ago

you can see my change in the PR, the second to last commit contains all the debug printfs.

pcercuei commented 14 hours ago

If you suspect memory corruption, try to compile with GCC's built-in address sanitizer. Add -fsanitize=address to the C flags, and -static-libasan to the linker flags.

If there is any use-after-free, or a buffer overflow, it will detect it.

catkira commented 13 hours ago

I tried it. It freezes without any sanitizer output.

nunojsa commented 13 hours ago

I think io->r_next is set to NULL, because it will always be inserted at the end of the list.

Sure, it marks the end of the list. My point is that if you're adding an object that already exists on the list and do io->r_next = NULL, you're cutting the list at that point right? So your code that tries to detect that the same object is being added won't really work (unless I'm missing something).

This printf("io->r_next = %d\n", io->r_next); printf showed, that the next item was not set to 0, but instead to itself.

And this really seems to suggest we're adding an object that is already on the list

Add -fsanitize=address to the C flags, and -static-libasan to the linker flags.

Likely a nice thing to add to cmake on a debug build...

pcercuei commented 12 hours ago

@pcercuei it looks like the same reader is added twice to the list which causes an endless loop

Yeah that's something that is definitely not supported (nor should it be). The real question then, is why is it added twice.

Note that there is one iiod_responder instance per buffer created, plus a main iiod_responder for things like reading attributes. When you call iio_cancel_buffer(), only the iiod_responder that corresponds to the buffer is canceled. At that point, since the underlying socket is deemed unreliable, the "free block" and "free buffer" commands go through the main iiod_responder instead.

I suspect that you are getting a race in the main "iiod_responder", maybe by calling iio_buffer_cancel() while another thread in your program is busy reading an attribute; or by calling iio_buffer_cancel() in a signal handler (that case is explicitely written as unsupported in the doc). From what I can read in the code, iiod_responder does have some multi-thread protection in place, but it does not protect against two threads issueing two requests on the same iiod_io.

catkira commented 12 hours ago

yes I also think that it is some kind of race condition, because I only encounter this when I run the code on a fast CPU.

catkira commented 11 hours ago

I think io->r_next is set to NULL, because it will always be inserted at the end of the list.

Sure, it marks the end of the list. My point is that if you're adding an object that already exists on the list and do io->r_next = NULL, you're cutting the list at that point right? So your code that tries to detect that the same object is being added won't really work (unless I'm missing something).

This printf("io->r_next = %d\n", io->r_next); printf showed, that the next item was not set to 0, but instead to itself.

And this really seems to suggest we're adding an object that is already on the list

ah yes, now I understand what you mean, had to read it twice and think a bit :)

catkira commented 11 hours ago

I suspect that you are getting a race in the main "iiod_responder", maybe by calling iio_buffer_cancel() while another thread in your program is busy reading an attribute; or by calling iio_buffer_cancel() in a signal handler (that case is explicitely written as unsupported in the doc). From what I can read in the code, iiod_responder does have some multi-thread protection in place, but it does not protect against two threads issueing two requests on the same iiod_io.

I am not calling any iio stuff from signal handlers.