Global identify events will cause packet loss

The current implementation of global identify events is as follows:

the event packet is picked up by the CANbus RX thread.
it ends up in nmranet_if_rx_data
which calls nmranet_event_packet_global
that iterates over all virtual nodes and calls nmranet_event_packet_addressed on it
each node in turn will traverse its event list
and generate an identify message for each of them
which will be put into the tx queue
the TX queue will not be processed due to the likely packet collisions on the bus
the TX queue will get full and the next write will block the RX thread
while there are lots of incoming packets coming from the network from other nodes
eventually overflowing the RX queue.

The first of those incoming packets will be processed only once all the event identify packets have been enqueued into the TX queue.

On May 14, 2013, at 3:01 AM, balazsracz notifications@github.com wrote:

The current implementation of global identify events is as follows:

• the event packet is picked up by the CANbus RX thread. • it ends up in nmranet_if_rx_data • which calls nmranet_event_packet_global • that iterates over all virtual nodes and calls nmranet_event_packet_addressed on it • each node in turn will traverse its event list • and generate an identify message for each of them • which will be put into the tx queue • the TX queue will not be processed due to the likely packet collisions on the bus • the TX queue will get full and the next write will block the RX thread • while there are lots of incoming packets coming from the network from other nodes • eventually overflowing the RX queue. The first of those incoming packets will be processed only once all the event identify packets have been enqueued into the TX queue.

There are a couple ways to handle things like this. I don't know details of how the OpenMRN code is organized, so am not sure about which is easiest to use, which meets goals of the library, etc.

One approach is just to have a bit in each Producer (and Consumer; I'll omit mentioning both for the rest of this email) that says "ProducerIdentified has been requested". Then, the node will service the set bits by sending the messages on a bandwidth-available basis. A couple notes:

*) It's OK that these bits can be set while set. If two IdentifyEvents come in, with the second received before the reply to the first is sent, the bit will be set while already set and only one reply will be sent. That's OK.

*) It's useful, but not vital, for the code that services these replies be aware of the other things that can be handled in a similar way (ConsumerIdentified, PIP replies, etc) and service them in priority order.

Another approach is to provide a TX queue that's large enough to maintain the maximum reply chain. This actually is a bounded quantity in OpenLCB, though if there are lots of (virtual) nodes with lots of producers and consumers, it can get large. You can compute the size for a particular configuration based on the protocols supported, number of items, etc and make sure you've got that much buffer space. Some notes:

*) Keeping it bounded requires removing duplicate messages, where allowed. For example, you only need to only queue one ProducerIdentified with a specific EventID and state. Alternately, only one from a particular Producer.

*) It is important that the RX/TX priority mechanism work for the queue to remain bounded. Particularly, a CAN node must be able to send when it has content to send, without skipping transmit slots while it's e.g. thinking about something. This is what gets replies out of the queue before more requests can come it. Without it, queue loads can grow very quickly.

Bob

Bob Jacobsen jacobsen@mac.com +1-510-486-7355 AIM, Skype JacobsenRG

I'm not sure that having queue length computations would be viable for anything except the simplest nodes. OpenMRN does not seem to me to be designed to the requirements of simple nodes -- it is way too heavy.

In general I am more thinking in the iterator approach to identify-events. This also creates constant memory need. The node should have something like a pointer to the next producer/consumer in the list for which the Identified message is to be sent out. As there is room in the tx queue the node should advance this iterator and generate the messages. If there is another identify events message coming in, that would just reset the iterator to the beginning.

From the implementation perspective the continuous queue drain is fortunately given by the design of freertos -- there are either separate threads handling the draining of the queue, or they are drained from ISR. Even if the priorities are not set right (which would be a bug), when the generating thread blocks there will be timeslice for the draining threads.

So in general in this case a plausible direction would be to start another thread for sending out the Identified messages. That thread would be progressing as slow as there is transmit queue capacity. The nice part is that if it is set to a low priority, and another thread needs to send a different message (e.g. event report), then by design of FreeRTOS the higher priority thread would be woken up. I hope. But at least they get round-robin treatment.

The big problem is that threads are expensive on an embedded device due to the stack space needed. I talked about this in a separate issue.

Hi All --

This points to the need to decide and document what are the (size/capability) goals of a particular code implementation. I am not sure one implementation can be targeted at all nodes whether large or small, since some fundamental choices must be made which may make that impossible. Openmrn (I assume) is directed at larger nodes, which have larger and more capable mpus and adequate memory space. The original Arduino implementation is targeted at smaller nodes with limited resources.

However, as an aside, the price point of 'larger' mpus continues to sink into the 'smaller' mpu territory. Design decisions meant to keep smaller nodes in the fold may need serious thinking about. But, I think it is best to still consider the two cases separately, since I suspect there will always be small nodes, designed for simple tasks, which can use the least expensive mpus

David

On Sun, May 19, 2013 at 8:36 AM, balazsracz notifications@github.comwrote:

I'm not sure that having queue length computations would be viable for anything except the simplest nodes. OpenMRN does not seem to me to be designed to the requirements of simple nodes -- it is way too heavy.

In general I am more thinking in the iterator approach to identify-events. This also creates constant memory need. The node should have something like a pointer to the next producer/consumer in the list for which the Identified message is to be sent out. As there is room in the tx queue the node should advance this iterator and generate the messages. If there is another identify events message coming in, that would just reset the iterator to the beginning.

From the implementation perspective the continuous queue drain is fortunately given by the design of freertos -- there are either separate threads handling the draining of the queue, or they are drained from ISR. Even if the priorities are not set right (which would be a bug), when the generating thread blocks there will be timeslice for the draining threads.

So in general in this case a plausible direction would be to start another thread for sending out the Identified messages. That thread would be progressing as slow as there is transmit queue capacity. The nice part is that if it is set to a low priority, and another thread needs to send a different message (e.g. event report), then by design of FreeRTOS the higher priority thread would be woken up. I hope. But at least they get round-robin treatment.

The big problem is that threads are expensive on an embedded device due to the stack space needed. I talked about this in a separate issue.

— Reply to this email directly or view it on GitHubhttps://github.com/bakerstu/openmrn/issues/6#issuecomment-18119476 .

The fundamental architecture has changed. This issue is no longer relevant.

bakerstu / openmrn

Global identify events will cause packet loss #6