Mesh packet pool overflow due to duplicate packets.

bayou9 commented 7 years ago

Hello, please refer to this issue: (1) https://github.com/NordicSemiconductor/nRF51-ble-bcast-mesh/issues/167

and this issue: (2) https://github.com/NordicSemiconductor/nRF51-ble-bcast-mesh/issues/79

for more information.

I'm having the same issue as (2), and (1) was my previous issue, the mesh_packet pool easily run out of memory, because it works on a "allocate a slot in packet pool first, question the content of the packet never!" basis, I'm currently runing a pool with the size of 43, and I have tried to increased it to 300 but to no avail, other than prolonging the time.

I have tried to purge the duplicate members of the packet pool by implementing a simple search and replace mechanism, please refer to the following code:

bool mesh_packet_acquire(mesh_packet_t** pp_packet) { uint32_t was_masked;

for (uint32_t i = 0; i < RBC_MESH_PACKET_POOL_SIZE; ++i)
{
    _DISABLE_IRQS(was_masked);
    if (g_packet_refs[i] == 0) /* no refs, free to use */
    {                        //This is how things work:loop through all packet refs to see if it's 
        g_packet_refs[i] = 1;//a 0. If it is, then set it to 1, meaning it's claimed.

    *pp_packet = &g_packet_pool[i];//now that above steps had acquired the proper, free index number i, we can allocate a pointer to point at it.
        _ENABLE_IRQS(was_masked);
        return true;
    }

else if(i>=4)
{
    for (uint32_t j = 0; j <=i ; j++)
    {
        if (g_packet_pool[j].payload[4] == g_packet_pool[i].payload[4])
            if (g_packet_pool[j].payload[5] == g_packet_pool[i].payload[5])
                if (g_packet_pool[j].payload[9] == g_packet_pool[i].payload[9])
                {
                    memcpy(g_packet_pool+j, g_packet_pool+i, sizeof(mesh_packet_t));

                    g_packet_refs[j]++;
                    g_packet_refs[i]=1;
                    memset(g_packet_pool+i, 0, sizeof(mesh_packet_t));
                    *pp_packet = &g_packet_pool[i];
                    _ENABLE_IRQS(was_masked);
                            return true;
                }

         }      

    }
    _ENABLE_IRQS(was_masked);
}
APP_ERROR_CHECK(NRF_ERROR_NO_MEM);
return false;

}

But it doesn't work, and here is the call stack window information: [http://imgur.com/a/nzbwf]

Clearly things went wrong when transmitting all instances.

So my question:

what went wrong? 2.how do I prevent the overflowing of packet pool? This is getting way too consuming.

Thanks for anyone's help!

twvd commented 7 years ago

You cannot just purge duplicate packets; the reference counting system takes care of memory management for you. If you start freeing packets on your own there will be problems with other pieces of code that still have pointers to the packet you are freeing and are still assuming they own them.

What is your application code doing, exactly? I have run stress tests with a pool size of 45 and have never had any problems running out of pool space. I think it is more likely there is a packet not being freed somewhere, filling the packet pool.

bayou9 commented 7 years ago

Hellol @twvd , thank you for your reply. My code sends and receives commands, only that it identifies every single node using MAC address.

I believe you are most likely right in assuming I can't just release packet pools on my own. So my questions:

Is doing this necessary at all? By that I mean there must be the code that detects replicate packets/ packets with the same handle and force them to use the same packet? Or not... I mean it could be that even if packets are the same it still allocate a different slot in the packet pool? Where is the section of code weeding out replication located?
If you have to modify the code to force packets with the same handle to use the same slot in the packet pool, where would be the best place to do that? For example, in vh_rxr?
Per question above, what are the things to keep in mind when modifying the code?

If you want I can attach my files but at this point I'm not sure whether that's a good idea, because if the code itself prevents the duplication then clearly my modification of the original code must have caused the problem, if the original code does not have such a function at all then clearly I should look into it before posting my code.

trond-snekvik commented 7 years ago

Hi there,

Like Thomas, we run stress tests for this problem regularly (since it's such a likely bug), and haven't really found any problems in the current version of the mesh, as long as the packet pool is larger than or equal to default in rbc_mesh.h:L102.

The handle storage will keep received packets around as long as they represent the most recent version of their handle. The handle storage will hold a single reference for each of these packets. You can check whether you have any packets that represent the same handle, but different versions. If you do, you may have a memory leak somewhere. Note that there will be unprocessed packets in the queues, that will have the same handle as some of the others.

Have you done any modifications to the mesh code? Does your application ALWAYS free an event-packet after handling it? Is your data and handle caches larger than the set of used handles in your mesh (they don't have to be, but this has been a cause of this error in development before)?

Do not release packets on your own. This will cause some packets to be referenced by the handle storage after they've been freed, and will trigger odd behavior. If you have to do this, you should call data_entry_free() in handle_storage.c:97

NordicPlayground / nRF51-ble-bcast-mesh

Mesh packet pool overflow due to duplicate packets. #169