Open crest42 opened 6 years ago
My check function in sock_udp_recv:
memcpy(data, pkt->data, pkt->size);
res = (int)pkt->size;
if (pkt->size == 48 && memcmp(data, pkt->data, pkt->size) != 0)
{
printf("%p-%p: ",data,(char *)data+pkt->size);
for (int i = 0; i < 48; i++)
{
printf("%02x",((char *)data)[i]);
}
printf("\n");
printf("%p-%p: ",pkt->data,(char *)pkt->data+pkt->size);
for (int i = 0; i < 48; i++)
{
printf("%02x",((char *)pkt->data)[i]);
}
printf("\n");
unsigned char *d = ((unsigned char *)data);
uint32_t tmp;
d = d + (3*sizeof(uint32_t));
memcpy(&tmp, d, sizeof(uint32_t));
uint32_t len = ntohl(tmp);
assert(len < 1000);
}
In the case of an error the memcmp returns != 0 and the printed buffer are different. I try to reproduce and send the corresponding output.
Was able to reproduce this: Checking the headers in grnc_sock.c:
printf("test\n");
if (pkt->size == 48)
{
unsigned char *data = ((unsigned char *)pkt->data);
uint32_t tmp;
memcpy(&tmp,data,sizeof(uint32_t));
uint32_t type = ntohl(tmp);
data = data + sizeof(uint32_t);
memcpy(&tmp, data, sizeof(uint32_t));
uint32_t from = ntohl(tmp);
data = data + sizeof(uint32_t);
memcpy(&tmp, data, sizeof(uint32_t));
uint32_t to = ntohl(tmp);
data = data + sizeof(uint32_t);
memcpy(&tmp, data, sizeof(uint32_t));
uint32_t len = ntohl(tmp);
printf("%d %d %d %d\n",type,from,to,len);
}
Returns:
test
11 65535 1023 32
Which is correct.
Running the memcmp test as described in my last comment returns:
copied 48 bytes
0x565a8bac-0x565a8bdc: 0000000b 1140ffff fffeffff ff800000 000000000000fffffffeffffff8000000000000000000000ffffffdeffffffadffffffbeffffffef0000000000000000ffffffdeffffffadffffffbeffffffefffffffa03e6756
0x565b50c0-0x565b50f0: 0000000b 0000ffff ffffffff ffff0000 03ffffffff00000020ffffffffffffffff00000004000000000000fffffffeffffff8000000000000000000000ffffffdeffffffadffffffbeffffffefffffffa03e6756
As you can clearly see the first 4 Byte 0x0000000b are the expected message type 11. The 4th to the 8th byte are wrong in the return buffer. As you can see in the packet the buffer is 0x0000ffff which is the expected node id 65535. The corresponding 4 byte in the return buffer are 0x1140ffff.
Starting at the 3th byte things do not look correct neither in the gnrc buffer and the return buffer. I will take a closer look and write an update on this
0x565e9bac-0x565e9bdc: 0000000b 1140fe80 00000000 0000c0c6 ff6f0000 00040000 00000000 fe800000 00000000 c0c637ff fe6ef8bc a0ce6656
0x565f60c0-0x565f60f0: 0000000b 00006fff 000077ff 00000020 ff6f0000 00040000 00000000 fe800000 00000000 c0c637ff fe6ef8bc a0ce6656
Now its easy to see that the 2nd to the 4th 4 bytes are somehow screwed. Rest looks fine. It looks like the wrong part is just part of the actuall payload of my protocol:
fe80 00000000 0000c0c6 is the beginning of an link local address is transferred in the payload: fe800000 00000000 c0c63
My best guess is that some kind of context switch happens inside the memcpy or some kind of weird alignment errors. I appreciate any help :)
@crest42 Can you share the application code?
Also, please rule out stack size issues, either with ps() or by increasing your application stack to saomething very large (8k).
Sure but application code is rather huge. One of the next things i will try is to minimize the application code to a few primitives to limit the problem to a specific part of the code.
Application code: https://github.com/crest42/RIOT ./chord_test. You need to clone it with --recurse-submodules. Just make a debug commit and the expected output when the fault happens is:
1542240127: [26623<-32768->53247] [INFO] thread_wait_for_msg: wait for message run
7495
11 26623 32768 32
test
11 26623 32768 32
copied 48 bytes
0x5660dbac-0x5660dbdc: 0000000b 1140fe80 00000000 0000c0c6 ff670000 00040000 00000000 fe800000 00000000 c0c637ff fe6ef8bc a0ce6756
0x5661a0c0-0x5661a0f0: 0000000b 000067ff 00008000 00000020 ff670000 00040000 00000000 fe800000 00000000 c0c637ff fe6ef8bc a0ce6756
sys/net/gnrc/sock/udp/gnrc_sock_udp.c:251 => 0x565c4bdd
*** RIOT kernel panic:
FAILED ASSERTION.
pid | name | state Q | pri | stack ( used) | base addr | current
- | isr_stack | - - | - | 8192 ( -1) | 0x566137c0 | 0x566137c0
1 | idle | pending Q | 15 | 8192 ( 584) | 0x566115e0 | 0x56613444
2 | main | pending Q | 7 | 12288 ( 5580) | 0x5660e5e0 | 0x56611444
3 | ipv6 | bl rx _ | 3 | 8192 ( 1980) | 0x56620420 | 0x56622284
4 | udp | bl rx _ | 4 | 8192 ( 1344) | 0x5661b800 | 0x5661d664
5 | gnrc_netdev_tap | bl rx _ | 2 | 8192 ( 3016) | 0x5661de40 | 0x5661fca4
6 | CHORD msg | running Q | 5 | 16384 ( 5128) | 0x5660a380 | 0x5660e1e4
7 | CHORD Periodic | bl mutex _ | 6 | 16384 ( 4780) | 0x56606380 | 0x5660a1e4
| SUM | | | 86016 (22412)
*** halted.
You can start a new network with "chord new" and join additional node with "chord join". The problem start with about 16-18 nodes.
A short introduction: I try to build a distributed storage using a DHT (Chord) using a mtd device driver in RIOT. The Chord lib is using two threads. One Thread is an simple event loop to manage incomming UDP messages and the other is a periodic loop which implements the Chord state keeping protocol. The fault happens in wait_for_message thread which does the incomming message handling. The other thread mostly crafts UDP messages and wait for an answer. I already refactored the network code a little bit to use htons and friends but that should not make any difference, since i run it on the same machine anyway. (In fact the problem occures with both methods). Also i tried to null the read buffer on every run of wait_for_message in network.c (The buffer which gets passed to the gnrc_udp_recv function).
The stack sizes should be no problem as i already make them rather huge. Here is a example ps output with 18 nodes running. I tried to use static memory in the DHT implementation so it should be constant:
pid | name | state Q | pri | stack ( used) | base addr | current
- | isr_stack | - - | - | 8192 ( -1) | 0x5660c7c0 | 0x5660c7c0
1 | idle | pending Q | 15 | 8192 ( 584) | 0x5660a5e0 | 0x5660c444
2 | main | running Q | 7 | 12288 ( 3180) | 0x566075e0 | 0x5660a444
3 | ipv6 | bl rx _ | 3 | 8192 ( 1900) | 0x56619420 | 0x5661b284
4 | udp | bl rx _ | 4 | 8192 ( 1344) | 0x56614800 | 0x56616664
5 | gnrc_netdev_tap | bl rx _ | 2 | 8192 ( 2812) | 0x56616e40 | 0x56618ca4
6 | CHORD msg | bl mbox _ | 5 | 16384 ( 4944) | 0x56603380 | 0x566071e4
7 | CHORD Periodic | bl mutex _ | 6 | 16384 ( 4940) | 0x565ff380 | 0x566031e4
| SUM | | | 86016 (19704)
Three additional remarks:
Regards, Robin
I just disabled interrupts before and enabled them after the copy. Now the problem does not occur in a 5 minute run with 18 nodes which is way longer than without the "fix". I am not sure if this really has an direct effect since printing just before the copy also make the problem disappear but it could be a hint in the right direction.
I made additional observations while debugging the problem:
In 100% of all failure cases "irq called" was printed (>15 atm). This also happens on non-failure cases but on a much rare occassion.
I will take a further look next week.
Atm i have a similar problem in this function:
int
copy_node(struct node* node, struct node* copy)
{
assert(node);
assert(copy);
memcpy(copy, node, sizeof(struct node));
assert(copy->id == node->id);
if(memcmp(&node->addr,©->addr,sizeof(node->addr)) != 0) {
printf("%p-%p -> %p-%p %d\n",(void*)node,
(void *)((char *)node+sizeof(struct node)),
(void*)copy,
(void *)((char *)copy+sizeof(struct node)),
sizeof(struct node));
assert(false);
}
return 0;
}
struct node
{
nodeid_t id; /*!< Id of the node. The node id is the hashed ipv6 address of
the node modulo the ring size */
uint32_t size;
uint32_t used;
struct in6_addr addr;
struct node_additional* additional; /*!< Pointer to our successor node. */
};
Which sometimes fails with:
> 0x566953e8-0x56695408 -> 0x5669cf7c-0x5669cf9c 32
chord_test/chord/src/chord.c:650 => 0x5665b30d
*** RIOT kernel panic:
FAILED ASSERTION.
pid | name | state Q | pri | stack ( used) | base addr | current
- | isr_stack | - - | - | 8192 ( -1) | 0x5671fe60 | 0x5671fe60
1 | idle | pending Q | 15 | 8192 ( 584) | 0x5671dc80 | 0x5671fae4
2 | main | pending Q | 7 | 524288 ( 4224) | 0x5669dc80 | 0x5671dae4
3 | ipv6 | bl rx _ | 3 | 8192 ( 2316) | 0x5672c540 | 0x5672e3a4
4 | udp | bl rx _ | 4 | 8192 ( 1344) | 0x56727ea0 | 0x56729d04
5 | gnrc_netdev_tap | bl rx _ | 2 | 8192 ( 2380) | 0x5672a4e0 | 0x5672c344
6 | CHORD msg | running Q | 5 | 16384 ( 5100) | 0x56699a20 | 0x5669d884
7 | CHORD Periodic | bl mbox _ | 6 | 16384 ( 3456) | 0x56695a20 | 0x56699884
| SUM | | | 598016 (19404)
*** halted.
native: exiting
So basically the same as in the network stack happens, right after the memcpy a memcmp of the values fails. From the output you can see that neither the addresses provided to memcpy are invalid nor the memory region overlaps. So right now i am not sure anymore if the problem i reported first is even related to gnrc, but i can't see a way how i messed this up in my application code.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.
Description
When using gnrc_sock_udp in my application after a certain amount of hosts in the network (and thus a certain amount of packets/s on a single host), the buffer content returned by sock_udp_recv is filled with garbage. My message format is defined to be 4 bytes + 4 bytes + 4 bytes + 4 bytes (message type, app src id, app dst id, length of the remainder data).
As an example: A message of type 11 always sends a static sized buffer with size 32 and thus the message "header" would look like this: 0x0000000b 0x00000001 0x00000002 0x00000020. (Second and third 4 bytes can differ)
Sometimes if many hosts are joining the network it happens very seldom that a single host instead read something like: 0x0000000b 0xfe80aaf0 0x00000000 0x00afbb20
Two intresting observations here:
The first 4 bytes always are correct. The second 4 bytes always contains fe80something (Maybe i am just paranoid but this looks like the beginning of a link local ipv6 address)
Further debugging:
Steps to reproduce the issue
Reading content from a udp socket and sending a lots (TBD) of messages. Testing the read buffer for correctness. Sadly no better way yet than try and wait.
Expected results
Content of buffer is correct
Actual results
Buffer content is garbage
Versions
Operating System Environment
Installed compiler toolchains
riscv-none-embed-gcc: missing clang: clang version 7.0.0 (tags/RELEASE_700/final)
Installed compiler libs
arm-none-eabi-newlib: "3.0.0" mips-mti-elf-newlib: missing riscv-none-embed-newlib: missing avr-libc: "2.0.0" ("20150208")
Installed development tools