OpenCyphal / libcanard

A compact implementation of the Cyphal/CAN protocol in C for high-integrity real-time embedded systems
http://opencyphal.org
MIT License
339 stars 195 forks source link

Improve RPC Client Response RX pipeline #228

Closed serges147 closed 2 months ago

serges147 commented 2 months ago

Currently I observe RPC client response timeouts b/c of periodic drops of incoming RX frames. Here is PDF diagram build on candump analysis (attached as well). Zubax_LibCyphal_CanTraffic.pdf candump13-2024-09-11_163217.log candump02-2024-09-11_163221.log

I'm not sure whether it is correct fix or not but the following patch makes such timeouts go away:

image
pavel-kirienko commented 2 months ago

The !same_transport condition is redundant because a frame with a matching transfer-ID would be accepted if it was the same iface anyway. This leaves us with two preconditions: whether the transfer-ID matches the expectation and whether the reassembly state machine is not currently mid-transfer.

The removal of the constraint that a new transfer must arrive from the same interface as the previous one (the same can be achieved by removing the correct_iface constraint from rxSessionUpdate) is generally not sound, because it may cause transfer duplication if the non-current interface is lagging behind the current one by exactly 32 transfers, which may occur in certain failure modes.

A possible solution that does not introduce undesirable side effects is to add a new acceptance condition:

    const bool restartable = (same_transport && tid_new) ||      // unchanged
                             (same_transport && tid_timeout) ||  // unchanged
                             (tid_timeout && tid_new) ||         // unchanged
                             (tid_timeout && tid_match && idle);  // new

where:

const bool tid_match = rxs->transfer_id == frame->transfer_id;
const bool idle = 0U == rxs->total_payload_size;

The purpose of tid_match is to eliminate duplicates arriving from redundant interfaces and to make the condition more selective such that it only activates if the transfers arrive in strictly the correct ordering. Without this check, per the diagram, we would accept the transfer with TID=0 twice. The idle condition is needed to prohibit changing the active interface mid-transfer as it may cause both old and new transfers to be lost.

Can you please check this?