victorpasse commented 8 years ago

Call stack: Debug ble app [GDB Hardware Debugging]
GDB Hardware Debugger (2016-03-23 16:24) (Suspended)
Thread [1](Suspended: Signal 'SIGTRAP' received. Description: Trace/breakpoint trap.)
8 app_error_handler() main.c:202 0x0001948a 7 mesh_packet_acquire() mesh_packet.c:73 0x00023bb2 6 order_search() transport_control.c:93 0x00024d36
5 radio_idle_callback() transport_control.c:191 0x00024d66
4 radio_event_handler() radio_control.c:337 0x0002482e
3 radio_signal_callback() timeslot_handler.c:307 0x00024b9c 2 0x00015e52
1 0x00015e52

What can I do about this? g_packet_refs is filled with 1 in every entry This happens after several thousands random sends and receives. I would assume that some path of the code leads to not releasing the package.

Cache sizes: /* @brief Default value for the number of handle cache entries /

ifndef RBC_MESH_HANDLE_CACHE_ENTRIES

#define RBC_MESH_HANDLE_CACHE_ENTRIES           (256)

endif

/* @brief Default value for the number of data cache entries /

ifndef RBC_MESH_DATA_CACHE_ENTRIES

#define RBC_MESH_DATA_CACHE_ENTRIES             (40)

endif

/* @brief Length of app-event FIFO. Must be power of two. /

ifndef RBC_MESH_APP_EVENT_QUEUE_LENGTH

#define RBC_MESH_APP_EVENT_QUEUE_LENGTH         (20)

endif

/* @brief Length of low level radio event FIFO. Must be power of two. /

ifndef RBC_MESH_RADIO_QUEUE_LENGTH

#define RBC_MESH_RADIO_QUEUE_LENGTH             (16)

endif

/* @brief Length of internal async-event FIFO. Must be power of two. /

ifndef RBC_MESH_INTERNAL_EVENT_QUEUE_LENGTH

#define RBC_MESH_INTERNAL_EVENT_QUEUE_LENGTH    (16)

endif

trond-snekvik commented 8 years ago

Hi, are you able to check the payloads of the packets, the head and tail of each queue sized by the parameters? Are they all full? Is the data cache full?

victorpasse commented 8 years ago

"m_task_fifo" {...}
elem_array 0x20006c0c
elem_size 0x8 array_len 0x20
head 0x29fa
tail 0x29dd
memcpy_fptr 0x00000000

"g_async_evt_fifo_ts" {...}
elem_array 0x20004840
elem_size 0x14
array_len 0x20
head 0x21a4a8
tail 0x21a4a7
memcpy_fptr 0x00000000

"g_async_evt_fifo" {...}
elem_array 0x200045c0
elem_size 0x14
array_len 0x20
head 0x224ba7
tail 0x224b91
memcpy_fptr 0x00000000

g_packet_refs is filled with 1 in every element except a few that has 2 references

"radio_fifo" {...}
elem_array 0x20005c30
elem_size 0xc array_len 0x10
head 0x38161d
tail 0x38161c
memcpy_fptr 0x00000000

m_handle_cache is included here: http://pastebin.com/UCpybPUa

"g_rbc_event_fifo" {...}
elem_array 0x20005a24
elem_size 0x18
array_len 0x10
head 0xa6645 tail 0xa663f memcpy_fptr 0x00000000

The last entry in g_packet_pool is this, note that payload[3] is not 0xFE g_packet_pool[94] {...}
header {...}
type 0x2 _rfu1 0x0 addr_type 0x1 _rfu2 0x0 length 0x17
_rfu3 0x1 addr 0x200059d7
addr[0] 0x4 addr[1] 0x81
addr[2] 0xa addr[3] 0x4f
addr[4] 0x3 addr[5] 0xdf
payload 0x200059dd
payload[0] 0x10
payload[1] 0x16
payload[2] 0xe4
payload[3] 0xfc
payload[4] 0x47
payload[5] 0x0 payload[6] 0xc payload[7] 0x27
payload[8] 0xf6
payload[9] 0x2f
payload[10] 0x71
payload[11] 0xdc
payload[12] 0x2a
payload[13] 0x4a
payload[14] 0x96
payload[15] 0xf9
payload[16] 0x61
payload[17] 0x0 payload[18] 0x0 payload[19] 0x0 payload[20] 0x0 payload[21] 0x0 payload[22] 0x0 payload[23] 0x0 payload[24] 0x0 payload[25] 0x0 payload[26] 0x0 payload[27] 0x0 payload[28] 0x0 payload[29] 0x0 payload[30] 0x0

Are there any more data that you would like to see? There also seems to be data in indices that I never use beeing transmitted on the mesh. I am getting the feeling of a buffer overflow that corrupts data somewhere.

trond-snekvik commented 8 years ago

Hi, sorry about the poor response time, easter is a public holiday here in Norway.

From my calculations, the number of packets referenced in your memory from queues and buffers should be

cache tasks queue: 29
async evt fifo TS:  1
async evt fifo:    22
radio tasks:        1
app-evt:            6
data-cache:        40
---------------------
=                  99

You may also have a couple of packets floating around in the lower priority contexts APP_LOW (async event) and main, but it shouldn't be more than two or three. How big is your packet pool?

Regardless of the size of the packet pool, that final packet seems fishy, good catch. Is the rest of the packet OK? The _rfu3-field in the header is 1, which seems odd as well.

We've been seeing some similar behavior to what you're describing when we subjected the mesh to heavy load, but according to our numbers, we were approaching false-positive-land of packet CRC's, with something like 27 broken packets having correct CRC out of 32 million. This was running the ping pong example with 30 devices all within radio proximity of each other. Since this wasn't an unlikely number (statistically) for the 24bit CRC's, we didn't spend a lot of time on it. That might have been a mistake though.

Have you done any instrumentation of the number of packets you're sending per second, that you're willing to share?

victorpasse commented 8 years ago

I am sending about 100 packets per second. After about 10 hours we see this problem so about a few M packets. After some time (about 1h) there also exists a lot of packages on air with handles that I have never sent with data lengths that I never use. This imples that the data gets overwritten by a buffer overflow or something similar. What will happen if a corrupt package with correct CRC is received? I have an additional 16bit CRC in all payloads that must checkout for me to process the package in the main context.

victorpasse commented 8 years ago

I changed the first "if" check in tc_packet_handler to the following

    mesh_adv_data_t* p_mesh_adv_data = mesh_packet_adv_data_get(p_packet);

    if ((p_packet->header.length > BLE_GAP_ADDR_LEN + BLE_ADV_PACKET_PAYLOAD_MAX_LENGTH) || (p_mesh_adv_data->adv_data_length+MESH_PACKET_ADV_OVERHEAD!=p_packet->header.length))

Before I got a lot of packages with p_mesh_adv_data->adv_data_length beeing values like 78 or 56 and other high numbers that should not occur I dont know how those values get sent but I should not repeat them.

victorpasse commented 8 years ago

Doing the above change has made the mesh more solid, after a couple of hours the previous symptoms are gone. I will do more tests. For completeness I need to see if any node transmitts these bogus updates or if they are the result of data corruption/crc faults in the RX section of the code.

victorpasse commented 8 years ago

Still no crasch after 20Houts of testing.

olsky commented 8 years ago

I have observed the same in setup of 10 to 20 nodes. now after your suggested changes I dont see frequent crashes, but still lots of "chatty traffic", and some corrupted packages.

I should probably add an extra crc as you did, although its an overhead.

trond-snekvik commented 8 years ago

Hi there,

I spent yesterday attempting to find what triggers the CRC issue, and came to some conclusions:

The fault only happens directly after a call to NVIC_SetPending(RADIO_IRQn) on line 261.
Forcing the IRQ handler to run once for the forced trigger before running the code responding to the HW-triggered interrupt fixes the issue, even though the end event occurs in both cases.
In the modified test code, the end event is cleared while in the disabled state, before TASK_RXEN/TXEN is triggered. This makes no difference.

An overnight test was run on the modified code, with 24 devices. By the end, each device had received approx. 26 million packets. Out of those, approx. 7.5 million packets had CRC errors.

The devices collectively received 624 million packets, with 180 million correctly detected CRC errors. Only a single false positive CRC occurred in total. The error happened independently of the NVIC-call, and on a packet with reported length =62 bytes. These packets should anyway be checked for in addition, as you're doing, @victorpasse.

This seems within the bounds of statistical probability for me, and I think it's safe to say the fix works. HOWEVER: The end event is checked for on the forced run as well (with the same result in both runs), and forcing two runs shouldn't really make a difference.. I spent some time walking through the code again, and it turns out the real reason is this:

The CRCSTATUS is fetched before the END-event in the IRQ handler. When triggering the IRQ manually, the END event (and therefore also the CRCSTATUS) may change between the sampling of CRCSTATUS and END. This may cause the CRCSTATUS to be that of the previous packet, as the end event isn't triggered yet when we sample it. This is thought to be the main cause of error.

Changing line 284 to 286 from

    bool crc_status = NRF_RADIO->CRCSTATUS;
    uint32_t crc = NRF_RADIO->RXCRC;
    bool end_event = NRF_RADIO->EVENTS_END;

to

    bool end_event = NRF_RADIO->EVENTS_END;
    bool crc_status = NRF_RADIO->CRCSTATUS;
    uint32_t crc = NRF_RADIO->RXCRC;

removes all symptoms. Try applying this fix to your code, I found that it removed just about all cases of erronous packets slipping through. I'll poke around some more to make sure it doesn't cause any new issues.

victorpasse commented 8 years ago

Thanks for your good work! I have implemented your fix and will test it the coming days.

daviddedwin commented 8 years ago

Fixed in 0.8.3

NordicPlayground / nRF51-ble-bcast-mesh

APP_ERROR_CHECK(NRF_ERROR_NO_MEM) in mesh_packet_acquire #79

ifndef RBC_MESH_HANDLE_CACHE_ENTRIES

endif

ifndef RBC_MESH_DATA_CACHE_ENTRIES

endif

ifndef RBC_MESH_APP_EVENT_QUEUE_LENGTH

endif

ifndef RBC_MESH_RADIO_QUEUE_LENGTH

endif

ifndef RBC_MESH_INTERNAL_EVENT_QUEUE_LENGTH

endif