[RFC] netdev: change receive mechanism

jia200x commented 5 years ago

Description

While working on #11483, I started evaluating some alternative receive procedures that could help to improve the performance of transceivers and maintainability.

Our current receive procedure using the netdev interface is the following:

ISR from transceiver and message to netif thread (ISR offloading)
Call dev->driver->isr(dev) from thread context
dev->event_callback(NETDEV_EVENT_RX_COMPLETE) is called from dev->driver->isr(dev) function
The event callback gets the packet size with dev->driver->recv(dev, NULL, 0)
The upper layer allocates a buffer and calls dev->driver->recv(dev, buffer, pkt_size)

However, there are some disadvantages of this procedure:

There are more SPI instructions and CPU cycles than needed when calling dev->driver->recv twice
The implementation of the dev->driver->recv function is tricky (usually radios don't understand the concept of "dropping" a packet, so there are blocks and chains of if-else in order to meet the requirements of buf != NULL, len != 0, etc. Also, some extra care is needed in order to avoid corrupting the frame buffer when reading the first received byte (PHY layer, Physical Service Data Unit length) with dev->driver->recv(dev, NULL, 0).
AFAIU, the concept of returning the packet length is for helping the upper layer (GNRC) to allocate the received packet. Considering in MAC layers like IEEE802.15.4 with ACK enabled half of transceiver traffic is Link Layer traffic, allocating packets for ACK is less efficient than having a fixed buffer.

Proposed procedure

I changed the receive procedure to indicate the rx of a packet, and modified the _recv function to simply write to a frame buffer (a.k.a rx_buf,127 bytes for IEEE802.15.4). I also added a PHY Indication function as a callback for the upper layer when a packet is received and a receive function:

The new procedure looks like:

ISR from transceiver and message to netif thread (ISR offloading)
Call dev->driver->isr(dev) from thread context
dev->event_callback(NETDEV_EVENT_RX_COMPLETE) is called from dev->driver->isr(dev) function.
The event callback calls the dev->driver->recv() function to write directly in a fixed size frame buffer (static or stack allocated), without the drop and pkt size logic. Save the pkt size in a psdu_len variable.
Call phy_indication(dev, rx_buf, psdu_len) to indicate the upper layer a packet was received.

Results

For both setups (current dev->driver->recv and PHY indication), I'm using an AT86RF233 radio, a logic analyzer and GPIO pins to measure time. All times include gnrc_pktbuf allocation, which can be reduced for the PHY indication if the MAC layer is processed directly in rx_buf)

The logic for the measurement is:

Set GPIO pin when RX_DONE is detected within the dev->driver->isr(dev) function
Clear GPIO right after netif->ops->recv()

For the PHY indication setup, I modified the dev->driver->recv function to (sort of) preserve the original behavior:

int _recv(...) {
    if (buf != NULL) {
        memcpy(buf, rx_buf, psdu_len);
    }
    return psdu_len;
}

Here is a graph comparing our current dev->driver->recv and the phy_indication mechanism for packets with different PSDU size:

This is the time overhead (%) of the current dev->driver->recv() procedure:

Note there's data duplication in rx_buf and the packet buffer, but could be reduced if the MAC layer is processed from the frame buffer

I've seen this pattern at least on OpenThread, OpenWSN and Linux (local buffers in stack). Contiki also has a similar approach.

Useful links

miri64 commented 5 years ago

For the most part I imagine that this makes implementation for some devices easier and as you have shown the handling of received packets faster. However, I don't like that this requires an extra buffer and thus potentially breaks the copy-once paradigm a network stack might want to ensure. Moreover, for link layers with smaller packet sizes this might be fine, but e.g. all Ethernet devices would require an extra 1500 B of RAM with that. This is IMHO a no-go, especially if it is not necessarily required.

kaspar030 commented 5 years ago

Yeah, allocations are slow. So are context switches. Maybe a network stack should avoid those. ;)

This doesn't need to be changed in netdev. The (admittedly somewhat complicated) recv logic especially allows leaving the buffer handling to the stack. Somewhere up, gnrc should maybe use static input buffers instead of dynamic packet snips. Netdev and the drivers support both dynamic and static packet allocation. I'd rather not change the logic to force static buffers.

jcarrano commented 5 years ago

My impression here is that these problems are caused in great part by not being able to start I/O operations (i.e. SPI transfers) from interrupts.

If that was possible and if a fixed size buffer was already available at reception time, then:

The driver gets a RX interrupt.
A (nonblocking) transfer operation is started from the interrupt.
Upon completion, the upper layer is notified (via a message).

What am I missing?

miri64 commented 5 years ago

If that was possible and if a fixed size buffer was already available at reception time, then:

That is the catch you are missing ;-). We can't have a fixed size buffer, guarantee zero copy, and be able to handle multiple packets without massive memory constraints.

kaspar030 commented 5 years ago

My impression here is that these problems are caused in great part by not being able to start I/O operations (i.e. SPI transfers) from interrupts.

Why? A context switch to userspace takes ~10us. That is the amount of time one spi byte takes on the wire at 100k bits/s. There's not much to gain by not switching to userspace before starting the transfer (which can then be synchronous or asynchronous).

Also, in an ideal world, all network devices are connected using something supporting asynchronous IO. Practically, which actually does? The answer is probably some devices on some of the RIO-supported platforms in some configurations.

The current architecture works with DMA/non-DMA UART/SPI/I2C/register based/posix read/write/whatever, there's no constraint on whether asynchronous IO is possible.

And, the used buffer doesn't need to be available a interrupt time.

Upon completion, the upper layer is notified (via a message).

For the record, messages (as in, core/msg) are unsuitable for ISR->thread communication because messages get (silently) dropped when the receiver thread is either not waiting or the message queue is full.

jcarrano commented 5 years ago

I was not concerned with switching times, but with overall complexity. If you look at both procedures described in the issue, you see several inversions of control. This logic is IMO quite hard to follow.

For the record, messages (as in, core/msg) are unsuitable for ISR->thread communication because messages get (silently) dropped when the receiver thread is either not waiting or the message queue is full.

It's not me who wrote that.

miri64 commented 5 years ago

For the record, messages (as in, core/msg) are unsuitable for ISR->thread communication because messages get (silently) dropped when the receiver thread is either not waiting or the message queue is full.

That's what #9326 is for ;-).

kaspar030 commented 5 years ago

I was not concerned with switching times, but with overall complexity. If you look at both procedures described in the issue, you see several inversions of control. This logic is IMO quite hard to follow.

Prove me wrong, but overall complexity will increase substantially as soon as portable asynchronous IO on ISR level is added.

Also, please come up with a logic that is "easier to follow" but a) does not prescribe the method of user space notification and b) does not prescribe the kind of memory allocation.

It's not me who wrote that.

Then I misundrstood your suggestion to use a message for notification.

jia200x commented 5 years ago

Moreover, for link layers with smaller packet sizes this might be fine, but e.g. all Ethernet devices would require an extra 1500 B of RAM with that. This is IMHO a no-go, especially if it is not necessarily required.

Indeed for Ethernet is a different story. In fact, most OS's allocate memory for LLs with huge MTU. This is usually not the case for LLs like IEEE802.15.4 (in fact, some OSs receive the packet in stack variables and then directly call the phy_indication function). See for instance this and this

Yeah, allocations are slow. ;)

Yes, indeed, but the graphs doesn't show the effect of allocation because in both setups I'm calling gnrc_pktbuf_add . The only difference is an extra rx_buf I defined in the second case, so there's a an extra memcpy (but still faster than calling netdev->driver->recv twice).

I have the impression if I do all the MAC logic without allocation, the time difference would be even bigger (I might try to find some time to show some results there).

This doesn't need to be changed in netdev. The (admittedly somewhat complicated) recv logic especially allows leaving the buffer handling to the stack. Somewhere up, gnrc should maybe use static input buffers instead of dynamic packet snips.

I agree it's possible to do both with netdev.

(Sorry for the long explanation, but I need to present some context)

Slightly out of the scope of this issue, but while working for some months in PHY/MAC layer rework, I've noticed a generic network abstraction with a southbound API to network devices might not be the best idea:

Below the MAC layer, most PHY and network devices (RF, ethernet) work in a completely different way. The MAC<->PHY interface for IEEE802.15.4 looks totally different than the one for Ethernet or LoRaWAN. Same for the PHY<->network device interface. Trying to make a generic interface in these layers doesn't make too much sense considering they are usually well defined and not interchangeable (an Ethernet MAC shouldn't talk to a LoRaWAN PHY in most cases)

A concrete example is the NETOPT_PRELOADING option. It was added for keeping a frame in the transceiver's framebuffer before calling tx_exec. This is a native operation of a IEEE802.15.4 transceiver, and a IEEEE802.15.4 PHY layer already knows how to "prepare" and "transmit" without needing to add this logic

Another example: transceivers are not aware of PAN ID, short address, long address, etc. Some devices (at86rf2xx) include some helpers to deal with common MAC layer operations (hw address filtering, etc), and have some functions to load the hw address. But still, the driver is not aware of the addresses. This means there's a specific MAC<->device interface for setting hw filter addresses, that won't be present in other PHY/MAC layers

Usually the convergence layer goes on top of the MAC. See how Linux "net_dev" diverges on top of mac80211 and the Ethernet driver (from Linux.com):

These lack of several interfaces between MAC, PHY and transceivers is one of the reasons why we have MAC/PHY components scattered everywhere (RF, netdev, gnrc_netif) and why we don't have more complex MAC layers (although we can send IEEE802.15.4 frame, we don't have a IEEE802.1.54 MAC layer with security, commisioning, coordinators, beacons, etc).

The netdev interface can work OK with a southbound API to MAC layers (or directly on top of devices with very simple MAC/PHY). I drew this picture some time ago to show how the netdev interface can work on top of SoftMAC (software MAC) and FullMAC (MAC implemented in the device, e.g CA8210):

There are well defined and stable northbound and southbound API's for MAC/PHY. Check for instance https://github.com/RIOT-OS/RIOT/issues/11324. We are reinventing the wheel each if we try to run glue code there.
In practice, the API on top of the network device is not sufficient for properly sending packets, doing CSMA, etc. In order to send data, it's required to be on top of the MAC layer, not on top of the device. Since we use netdev directly on top of the network device, we have to write part of the MAC layer each time we want to send data (see [gnrc_netif](https://github.com/RIOT-OS/RIOT/blob/master/sys/net/gnrc/netif/ieee802154/gnrc_netif_ieee802154.c#L232 and AT86RF2xx test driver). The semantic of "send", "recv", "get" and "set" is not the best for an API for a transceiver (but can be on top of a MAC layer).

Maybe it's not the smartest idea to open the discussion here, I can open an issue if you want.

Going back to the original discussion and the points mentioned above, it could be possible to have different receive procedures depending on the MAC layer (it makes a lot of sense to allocate 1500 byte for Ethernet, but maybe not 127 bytes for IEEE802.15.4).

Opinions?

kaspar030 commented 5 years ago

The only difference is an extra rx_buf I defined in the second case, so there's a an extra memcpy (but still faster than calling netdev->driver->recv twice).

If calling the receive function twice is slower, it means that accessing the device twice (via SPI) is the bottleneck here. In order to mitigate, the "next packet size" could be cached, or L2 could just assume a maximum packet size (as it does for ethernet) and then save the first call.

jia200x commented 5 years ago

@kaspar030 yes, caching the next packet size would save some time.

What do you mean with "L2 could just assume a maximum packet size (as it does for ethernet)"? Having a fixed static buffer there?

kaspar030 commented 5 years ago

What do you mean with "L2 could just assume a maximum packet size (as it does for ethernet)"? Having a fixed static buffer there?

IIUC you've shown that the actual allocation is not as bad as the extra "recv()" call, as the latter touches the device over SPI. For netdev_tap (sorry, not ethernet in general), as the size of a packet cannot be figured out in advance, the GNRC always allocates a whole linklayer MTU (1500b), retrieves the packet, then shrinks the allocation to what has actually been used. GNRC could do the same for e.g., 802.15.4: always allocate 127b, then shrink the allocation in pktbuf after the packet has been read (and thus the size is known).

miri64 commented 5 years ago

This may lead to a virtual packet loss because (even for a short while) you are allocating more memory than actually needed, so if the whole MTU does not fit into the packet buffer (but the actual packet would) we can't do anything but abort the receive even though it might have been possible.

kaspar030 commented 5 years ago

we can't do anything but abort the receive

The "recv()" call determining the actual size could be a fallback option:

pktsnip_t *pkt = allocate(MTU);
if (!pkt) {
  size_t pktsize = recv(dev, NULL, 0);
  pkt = allocate(pktsize);
}
if (!pkt) die("pktbuf full");

miri64 commented 5 years ago

If that still exists I would be fine with that :-).

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

fjmolinas commented 4 years ago

@jia200x just to confirm this issue is not relevant anymore?

jia200x commented 4 years ago

@fjmolinas the discussion of the netif rework is going in the direction of having specific HALs for each technology (as described in https://github.com/RIOT-OS/RIOT/issues/12469). So I would say it's fine to close this one

RIOT-OS / RIOT