NordicPlayground / nRF51-ble-bcast-mesh

Other
323 stars 121 forks source link

Mesh TX stalling due to data cache issue #156

Open twvd opened 7 years ago

twvd commented 7 years ago

Hello,

While stress testing the mesh I've come across a problem where at some point (20 - 30 seconds) mesh nodes would start dropping out and no longer transmitting any data. My data cache size and handle cache size defines are both 8 at the time.

At the time the mesh is stalled, the call to data_entry_allocate() in handle_storage_info_set() was returning DATA_CACHE_ENTRY_INVALID, causing handle_storage_info_set() to return NRF_ERROR_NO_MEM (line 392).

At the time this problem was occuring, my caches looked like this:

m_handle_cache  <array> 0x200036C8  handle_entry_t[8]   
    [0] <struct>    0x200036C8  handle_entry_t  
        handle  23261   0x200036C8  uint16_t    
        version 0   0x200036CA  uint16_t    
        index_next  1   0x200036CC  uint16_t    
        tx_event    0   0x200036CC  uint16_t    
        index_prev  7   0x200036CE  uint16_t    
        persistent  0   0x200036CE  uint16_t    
        data_entry  8   0x200036D0  uint16_t    
    [1] <struct>    0x200036D2  handle_entry_t  
        handle  23393   0x200036D2  uint16_t    
        version 0   0x200036D4  uint16_t    
        index_next  2   0x200036D6  uint16_t    
        tx_event    0   0x200036D6  uint16_t    
        index_prev  0   0x200036D8  uint16_t    
        persistent  0   0x200036D8  uint16_t    
        data_entry  8   0x200036DA  uint16_t    
    [2] <struct>    0x200036DC  handle_entry_t  
        handle  23921   0x200036DC  uint16_t    
        version 0   0x200036DE  uint16_t    
        index_next  3   0x200036E0  uint16_t    
        tx_event    0   0x200036E0  uint16_t    
        index_prev  1   0x200036E2  uint16_t    
        persistent  0   0x200036E2  uint16_t    
        data_entry  8   0x200036E4  uint16_t    
    [3] <struct>    0x200036E6  handle_entry_t  
        handle  10470   0x200036E6  uint16_t    
        version 0   0x200036E8  uint16_t    
        index_next  4   0x200036EA  uint16_t    
        tx_event    0   0x200036EA  uint16_t    
        index_prev  2   0x200036EC  uint16_t    
        persistent  0   0x200036EC  uint16_t    
        data_entry  8   0x200036EE  uint16_t    
    [4] <struct>    0x200036F0  handle_entry_t  
        handle  3   0x200036F0  uint16_t    
        version 0   0x200036F2  uint16_t    
        index_next  5   0x200036F4  uint16_t    
        tx_event    0   0x200036F4  uint16_t    
        index_prev  3   0x200036F6  uint16_t    
        persistent  0   0x200036F6  uint16_t    
        data_entry  8   0x200036F8  uint16_t    
    [5] <struct>    0x200036FA  handle_entry_t  
        handle  23789   0x200036FA  uint16_t    
        version 0   0x200036FC  uint16_t    
        index_next  8   0x200036FE  uint16_t    
        tx_event    0   0x200036FE  uint16_t    
        index_prev  4   0x20003700  uint16_t    
        persistent  0   0x20003700  uint16_t    
        data_entry  8   0x20003702  uint16_t    
    [6] <struct>    0x20003704  handle_entry_t  
        handle  2   0x20003704  uint16_t    
        version 0   0x20003706  uint16_t    
        index_next  7   0x20003708  uint16_t    
        tx_event    0   0x20003708  uint16_t    
        index_prev  8   0x2000370A  uint16_t    
        persistent  0   0x2000370A  uint16_t    
        data_entry  8   0x2000370C  uint16_t    
    [7] <struct>    0x2000370E  handle_entry_t  
        handle  23830   0x2000370E  uint16_t    
        version 0   0x20003710  uint16_t    
        index_next  0   0x20003712  uint16_t    
        tx_event    0   0x20003712  uint16_t    
        index_prev  6   0x20003714  uint16_t    
        persistent  0   0x20003714  uint16_t    
        data_entry  8   0x20003716  uint16_t    
m_data_cache    <array> 0x200034CC  data_entry_t[8] 
    [0] <struct>    0x200034CC  data_entry_t    
        trickle <struct>    0x200034CC  trickle_t   
        p_packet    0x00000000  0x200034DC  mesh_packet_t * 
    [1] <struct>    0x200034E0  data_entry_t    
        trickle <struct>    0x200034E0  trickle_t   
        p_packet    0x00000000  0x200034F0  mesh_packet_t * 
    [2] <struct>    0x200034F4  data_entry_t    
        trickle <struct>    0x200034F4  trickle_t   
        p_packet    0x00000000  0x20003504  mesh_packet_t * 
    [3] <struct>    0x20003508  data_entry_t    
        trickle <struct>    0x20003508  trickle_t   
        p_packet    0x00000000  0x20003518  mesh_packet_t * 
    [4] <struct>    0x2000351C  data_entry_t    
        trickle <struct>    0x2000351C  trickle_t   
        p_packet    0x00000000  0x2000352C  mesh_packet_t * 
    [5] <struct>    0x20003530  data_entry_t    
        trickle <struct>    0x20003530  trickle_t   
        p_packet    0x00000000  0x20003540  mesh_packet_t * 
    [6] <struct>    0x20003544  data_entry_t    
        trickle <struct>    0x20003544  trickle_t   
        p_packet    0x00000000  0x20003554  mesh_packet_t * 
    [7] <struct>    0x20003558  data_entry_t    
        trickle <struct>    0x20003558  trickle_t   
        p_packet    0x00000000  0x20003568  mesh_packet_t *

As you can see, the data cache is empty and no handle caches are referencing the data cache.

In data_entry_allocate() there is a loop at the top checking for unused data cache entries, but this loop stops after RBC_MESH_DATA_CACHE_ENTRIES because the variable 'allocated' is static and never reset back to 0. So, after the data cache has been filled RBC_MESH_DATA_CACHE_ENTRIES times, it starts looking for unused handle cache entries to free them, but as you can see in my case, the handle cache is unlinked and the data cache is free, so nothing happens and the mesh can no longer send any packets.

My fix for now (which seems to work perfectly) is to change this line in data_entry_allocate from: for (uint32_t i = allocated; i < RBC_MESH_DATA_CACHE_ENTRIES; ++i) to for (uint32_t i = 0; i < RBC_MESH_DATA_CACHE_ENTRIES; ++i) disregarding the 'allocated' variable and making use of the initial loop to check for 'free' data cache entries again. The 'allocated' variable seems to be useless, because the data cache already has a way of being marked as 'in use' or 'free'.

Please share your opinion on this fix, perhaps I'm missing something.

trond-snekvik commented 7 years ago

Hi, sorry for the poor response time.

That's a nice analysis, and a perfectly valid fix to a pretty stupid bug :+1:

I'll push your fix :)