Open catkira opened 1 day ago
The problem seems to be that iiod_rw_all() is not returning.
@pcercuei do you have a quick idea how to fix this?
The underlying problem seems to be that recv() is blocking https://github.com/analogdevicesinc/libiio/blob/790a6c9324c775e38ab319534d97f0e3253336ff/network.c#L106C1-L106C50
void do_cancel(struct iiod_client_pdata *io_ctx)
in network-unix.c gets called, but it doesnt seem to abort the recv() call ^^
void do_cancel(struct iiod_client_pdata *io_ctx) in network-unix.c gets called, but it doesnt seem to abort the recv() call ^^
I guess that only matters for cancellable operations. If that flag is set to true, do_cancel() should do the trick. But if we are not getting into poll(2), then that cancel_fd should not matter.
But in that case MSG_DONTWAIT should be set and recv(2) should not block...
cancellable &= !(flags & MSG_DONTWAIT);
evaluates to true in my case
it seems likewait_cancellable(io_ctx, true, timeout_ms);
is blocking in my case and never returns
something with the cancel_fd is not working
Yeah, it seems there's some subtle bug in that logic... Because do_cancel()
should do the trick
The problem is pfd[1].revents
is always 0 even if cancel was called.
read...
network_recv...
cancellable = 1
wait_cancellable...
destroying...
iio_buffer_cancel...
task_stop
task stop: 0 0 id: 28940624
do cancel...
do cancel done
iio_buffer_cancel done
free_buffer...
free1
pfd[1].revents = 0
task stop: 0 1 id: 28626704
something is wrong with the cancel_fd, it has wrong value when poll() is called after cancelling the buffer.
the other cancel_fd is from the other buffer that my application has open, so thats not the problem. I have to debug a bit more...
its a bit difficult to debug, because every iio context and every buffer has its own thread which all use functions from network.c and there is no real documentation about the internals of libiio ^^
@nunojsa I have some more findings, it goes into a 100% cpu load endless loop here https://github.com/analogdevicesinc/libiio/blob/790a6c9324c775e38ab319534d97f0e3253336ff/iiod-responder.c#L571-L572
in c++ this would just be a std::vector and so much easier to debug
@pcercuei it looks like the same reader is added twice to the list which causes an endless loop
[2024-11-06 11:34:23.757] [control_center::iio_buf_reader] [warning] --- dequeue: timed out ---
thread = 1778366016, adding reader 179795776
list item 0 = 181190880
list item 1 = 180965088
list item 2 = 179790880
list item 3 = 180897360
list item 4 = 180966064
list item 5 = 180047712
list item 6 = 180763824
list item 7 = 179200736
list item 8 = 180901488
list item 9 = 181200016
list item 10 = 183120608
list item 11 = 178781568
list item 12 = 181129104
list item 13 = 180895728
list item 14 = 179365760
list item 15 = 181301104
thread = 1786758720, wait_cancellable done
b ts: 0 1
c ts: 0 1
x1
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
x2
task stop: 0 1 id: 31090272
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, read done
thread = 1786758720, read...
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1778366016, adding reader 179795776
list item 0 = 181190880
list item 1 = 180965088
list item 2 = 179790880
list item 3 = 180897360
list item 4 = 180966064
list item 5 = 180047712
list item 6 = 180763824
list item 7 = 179200736
list item 8 = 180901488
list item 9 = 181200016
list item 10 = 183120608
list item 11 = 178781568
list item 12 = 181129104
list item 13 = 180895728
list item 14 = 179365760
list item 15 = 181301104
list item 16 = 178759280
thread = 1778366016, discard reader 181190880
[2024-11-06 11:34:23.758] [control_center::iio_buf_reader] [warning] --- dequeue: timed out ---
b ts: 0 1
c ts: 0 1
x1
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
thread = 1778366016, adding reader 181190880
list item 0 = 180965088
list item 1 = 179790880
list item 2 = 180897360
list item 3 = 180966064
list item 4 = 180047712
list item 5 = 180763824
list item 6 = 179200736
list item 7 = 180901488
list item 8 = 181200016
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
x2
list item 9 = 183120608
list item 10 = 178781568
list item 11 = 181129104
task stop: 0 1 id: 31090272
list item 12 = 180895728
list item 13 = 179365760
list item 14 = 181301104
list item 15 = 178759280
list item 16 = 179795776
list item 17 = 179795776
list item 18 = 179795776
list item 19 = 179795776
list item 20 = 179795776
list item 21 = 179795776
list item 22 = 179795776
list item 23 = 179795776
list item 24 = 179795776
list item 25 = 179795776
list item 26 = 179795776
list item 27 = 179795776
list item 28 = 179795776
list item 29 = 179795776
list item 30 = 179795776
list item 31 = 179795776
list item 32 = 179795776
list item 33 = 179795776
list item 34 = 179795776
list item 35 = 179795776
list item 36 = 179795776
list item 37 = 179795776
list item 38 = 179795776
list item 39 = 179795776
list item 40 = 179795776
list item 41 = 179795776
list item 42 = 179795776
list item 43 = 179795776
list item 44 = 179795776
list item 45 = 179795776
list item 46 = 179795776
list item 47 = 179795776
list item 48 = 179795776
list item 49 = 179795776
list item 50 = 179795776
list item 51 = 179795776
list item 52 = 179795776
list item 53 = 179795776
list item 54 = 179795776
list item 55 = 179795776
list item 56 = 179795776
list item 57 = 179795776
list item 58 = 179795776
list item 59 = 179795776
list item 60 = 179795776
list item 61 = 179795776
list item 62 = 179795776
list item 63 = 179795776
list item 64 = 179795776
list item 65 = 179795776
list item 66 = 179795776
list item 67 = 179795776
list item 68 = 179795776
list item 69 = 179795776
list item 70 = 179795776
list item 71 = 179795776
list item 72 = 179795776
list item 73 = 179795776
list item 74 = 179795776
list item 75 = 179795776
list item 76 = 179795776
list item 77 = 179795776
list item 78 = 179795776
list item 79 = 179795776
list item 80 = 179795776
list item 81 = 179795776
list item 82 = 179795776
list item 83 = 179795776
list item 84 = 179795776
list item 85 = 179795776
list item 86 = 179795776
list item 87 = 179795776
list item 88 = 179795776
list item 89 = 179795776
list item 90 = 179795776
list item 91 = 179795776
list item 92 = 179795776
list item 93 = 179795776
list item 94 = 179795776
list item 95 = 179795776
list item 96 = 179795776
list item 97 = 179795776
list item 98 = 179795776
list item 99 = 179795776
list item 100 = 179795776
/* Add it to the readers list */
printf("thread = %u, adding reader %d\n", pthread_self(), io);
if (!priv->readers) {
priv->readers = io;
} else {
unsigned int i = 0;
for (tmp = priv->readers; tmp->r_next; ) {
printf("list item %d = %d\n", i++,tmp);
tmp = tmp->r_next;
if (i > 100) exit(1);
}
tmp->r_next = io;
}
it looks like the same reader is added twice to the list which causes an endless loop
Interesting findings...
We can obviously add a protection so we error out gracefully (or just ignore the entry) if we add an entry that already exists (I guess that should not be possible) even though this is only a real problem if the repeated entries are sequential so that entry->next = entry (which pretty much means list_empty - but not in this case I guess). But more importantly would be to find out why we get the same entry added twice...
@catkira thanks for debugging this
yes I tried this work around, but then it still hangs sometimes at other locations (probably because of the same underlying reason)
/* Add it to the readers list */
printf("thread = %u, adding reader %d\n", pthread_self(), io);
bool found = false;
if (priv->readers) {
for (tmp = priv->readers; tmp->r_next; ) {
if (tmp == io)
{
printf("reader is already in list!!\n");
found = true;
}
tmp = tmp->r_next;
}
}
if (!found) {
if (!priv->readers) {
priv->readers = io;
} else {
unsigned int i = 0;
for (tmp = priv->readers; tmp->r_next; ) {
printf("list item %d = %d\n", i++,tmp);
tmp = tmp->r_next;
if (i > 100) exit(1);
}
tmp->r_next = io;
}
}
I still get endless loops at the same location even when the same item is not inserted twice
maybe memory corruption??
/* Add it to the readers list */
printf("thread = %u, adding reader %d\n", pthread_self(), io);
bool found = false;
if (priv->readers) {
for (tmp = priv->readers; tmp->r_next; ) {
if (tmp == io)
{
printf("reader is already in list!!\n");
found = true;
exit(1);
}
if (tmp == tmp->r_next) {
printf("loop detected!!!");
exit(1);
}
tmp = tmp->r_next;
}
}
if (!found) {
if (!priv->readers) {
priv->readers = io;
} else {
unsigned int i = 0;
for (tmp = priv->readers; tmp->r_next; ) {
printf("list item %d = %d\n", i++,tmp);
tmp = tmp->r_next;
if (i > 100) exit(1);
}
tmp->r_next = io;
}
}
thread = 2539071040, adding reader 182425776
loop detected!!!
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, try lock...
thread = 2588911168, locked
thread = 2588911168, discard reader 182140576
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 3178092864, adding reader 182140576
b ts: 0 1
c ts: 0 1
x1
thread = 2597303872, poll cancel_fd 15
thread = 2597303872, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
x2
task stop: 0 1 id: 32557408
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, try lock...
thread = 2588911168, locked
thread = 2588911168, discard reader 182140576
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
terminate called without an active exception
Aborted (core dumped)
it must be memory corruption
it must be memory corruption
Auch... Might very well be but let's hop it's not :). I see your loop detected!!!
and looking at iiod_io_get_response_async()
I wonder how that could be as we do io->r_next = NULL;
and then I started to look at __iiod_io_cancel_unlocked()
. Maybe it's worth putting some debug in there. It might be the place where the loop is created...
BTW, I'm not sure if the code you pasted is the one you used but if it is, I think we need to check for an existing object before doing io->r_next = NULL;
. Because, if the object already exists, that assignment will cut the list at that point and your loop will never find the repeated object because tmp->r_next == NULL
. Right?
So __iiod_io_cancel_unlocked()
should not be the problem and adding a struct iiod_io
object that is already in the list cannot be permitted...
I think io->r_next is set to NULL, because it will always be inserted at the end of the list.
When I change the code that adds items to the reader list like this, it does not hang anymore.
This printf("io->r_next = %d\n", io->r_next);
printf showed, that the next item was not set to 0, but instead to itself. Manually setting it to 0 seems to fix it. But why do I have to set it to 0, it is already done before io->r_next = NULL;
??
/* Add it to the readers list */
printf("thread = %u, adding reader %d\n", pthread_self(), io);
bool found = false;
if (priv->readers) {
for (tmp = priv->readers; tmp->r_next; ) {
if (tmp == io)
{
printf("reader is already in list!!\n");
found = true;
exit(1);
}
if (tmp == tmp->r_next) {
printf("loop detected!!!\n");
exit(1);
}
tmp = tmp->r_next;
}
}
if (!found) {
if (!priv->readers) {
priv->readers = io;
} else {
unsigned int i = 0;
for (tmp = priv->readers; tmp->r_next; ) {
printf("list item %d = %d\n", i++,tmp);
tmp = tmp->r_next;
if (i > 100) exit(1);
}
tmp->r_next = io;
printf("io->r_next = %d\n", io->r_next);
io->r_next = NULL;
}
}
you can see my change in the PR, the second to last commit contains all the debug printfs.
If you suspect memory corruption, try to compile with GCC's built-in address sanitizer.
Add -fsanitize=address
to the C flags, and -static-libasan
to the linker flags.
If there is any use-after-free, or a buffer overflow, it will detect it.
I tried it. It freezes without any sanitizer output.
I think io->r_next is set to NULL, because it will always be inserted at the end of the list.
Sure, it marks the end of the list. My point is that if you're adding an object that already exists on the list and do io->r_next = NULL
, you're cutting the list at that point right? So your code that tries to detect that the same object is being added won't really work (unless I'm missing something).
This printf("io->r_next = %d\n", io->r_next); printf showed, that the next item was not set to 0, but instead to itself.
And this really seems to suggest we're adding an object that is already on the list
Add -fsanitize=address to the C flags, and -static-libasan to the linker flags.
Likely a nice thing to add to cmake on a debug build...
@pcercuei it looks like the same reader is added twice to the list which causes an endless loop
Yeah that's something that is definitely not supported (nor should it be). The real question then, is why is it added twice.
Note that there is one iiod_responder
instance per buffer created, plus a main iiod_responder
for things like reading attributes.
When you call iio_cancel_buffer()
, only the iiod_responder that corresponds to the buffer is canceled. At that point, since the underlying socket is deemed unreliable, the "free block" and "free buffer" commands go through the main iiod_responder
instead.
I suspect that you are getting a race in the main "iiod_responder", maybe by calling iio_buffer_cancel() while another thread in your program is busy reading an attribute; or by calling iio_buffer_cancel() in a signal handler (that case is explicitely written as unsupported in the doc). From what I can read in the code, iiod_responder does have some multi-thread protection in place, but it does not protect against two threads issueing two requests on the same iiod_io.
yes I also think that it is some kind of race condition, because I only encounter this when I run the code on a fast CPU.
I think io->r_next is set to NULL, because it will always be inserted at the end of the list.
Sure, it marks the end of the list. My point is that if you're adding an object that already exists on the list and do
io->r_next = NULL
, you're cutting the list at that point right? So your code that tries to detect that the same object is being added won't really work (unless I'm missing something).This printf("io->r_next = %d\n", io->r_next); printf showed, that the next item was not set to 0, but instead to itself.
And this really seems to suggest we're adding an object that is already on the list
ah yes, now I understand what you mean, had to read it twice and think a bit :)
I suspect that you are getting a race in the main "iiod_responder", maybe by calling iio_buffer_cancel() while another thread in your program is busy reading an attribute; or by calling iio_buffer_cancel() in a signal handler (that case is explicitely written as unsupported in the doc). From what I can read in the code, iiod_responder does have some multi-thread protection in place, but it does not protect against two threads issueing two requests on the same iiod_io.
I am not calling any iio stuff from signal handlers.
This is caused by iiod_responder_wait_done() in iiod-responder.c. The thread is probably not stopped correctly and therefore cannot join.
Testscenario: latest master libiio, network context to sdr.