apache / nuttx

Apache NuttX is a mature, real-time embedded operating system (RTOS)
https://nuttx.apache.org/
Apache License 2.0
2.86k stars 1.17k forks source link

NuttX Net & IOB DMA incompatiblity #6835

Open PetervdPerk-NXP opened 2 years ago

PetervdPerk-NXP commented 2 years ago

Currently almost all Ethernet MAC with DMA roughly work in the following manner

sequenceDiagram
    participant DMA
    participant MAC driver
    MAC driver->>MAC driver: Allocate rxbuffer[N * (MTU + DESC + PAD)]
    MAC driver->>DMA: Setup DMA Engine
    DMA ->>MAC driver: IRQ Event packet received<br> and copied to rxbuffer[]
    MAC driver ->>Workqueue: Schedule workqueue<br> to handle packet
    Workqueue ->>Net: Process packet <br>(TCP Assembly, Checksum etc)<br> send to network stack
    Net ->> IOB: Push packet data to IOB buffer <br>and notify socket listener
    IOB ->> App : App executes write(s, buffer[]) <br>copy data from IOB to local buffer

As result the same packet data will be in

MAC: rxbuffer
IOB: buffer
App: local buffer

Ideally we want the MAC DMA Engine to directly copy the packet to the having the following scheme.

sequenceDiagram
    participant DMA
    participant MAC driver
    participant Workqueue
    participant Net
    MAC driver->>DMA: Setup DMA Engine
    DMA ->> IOB : Copy Packet to IOB
    DMA ->>MAC driver: IRQ Event packet received<br> and copied to iob
    MAC driver ->>Workqueue: Schedule workqueue<br> to handle packet
    Workqueue ->>Net: Send to network stack
    Net ->> MAC driver: Since data is in MAC specific DMA format<br>Callback to MAC to parse received data
    alt CRC Done by MAC
        Net->>MAC driver: Request bit to check CRC is correct
    else CRC Done in software
        Net->>Net: Calculate CRC and verify
    end
    Net ->> App:  Notify socket listener
    IOB ->> App : App executes write(s, buffer[]) <br>copy data from IOB to local buffer

The biggest problem is though that MAC has it's own representation of the DMA descriptor + buffer. IMXRT https://github.com/apache/incubator-nuttx/blob/5d12e350da31324e632ee7da3a3062b05212b74c/arch/arm/src/imxrt/hardware/imxrt_enet.h#L650-L662 STM32H7 https://github.com/apache/incubator-nuttx/blob/5d12e350da31324e632ee7da3a3062b05212b74c/arch/arm/src/stm32h7/hardware/stm32_ethernet.h#L662-L670

PR #6834 is trying to address this, however on system with mixed interfaces utilizing the NET stack & IOB. For example Interface MTU
Ethernet 1518
WIFI 576
CAN2.0B 13

When adding the DMA Descriptor to the IOB as proposed in #6834 each IOB in the case of IMXRT would by grow by 29bytes.

I can think of some solutions but I'm not sure what would be best:

  1. Make the IOB buffer peripheral specific, hence for the example above we will get 3 separate IOB buffer pools
  2. Create a 2nd DMA aware IOB iob_dma that has metadata of the DMA descriptor and pheripheral. Where the Network stack can invoke callbacks to the pheripheral to parse the received metadata (also nice to check for HW offloading and then fallback to the SW method)
  3. Create IOB buffer pool that supports variable data length, a good example could be https://github.com/pavel-kirienko/o1heap

Another thing I want to address is to support the HW offloading bits, there are ethernet MAC's that have IP offloading features such as checksum calculation done by MAC and simply indicate using a bit that the CRC in the DMA descriptor, or indicate that an automatic ARP response has been send.

xiaoxiang781216 commented 2 years ago

Nice diagram!

PR #6834 is trying to address this, however on system with mixed interfaces utilizing the NET stack & IOB. For example

Interface MTU Ethernet 1518 WIFI 576 CAN2.0B 13 When adding the DMA Descriptor to the IOB as proposed in #6834 each IOB in the case of IMXRT would by grow by 29bytes.

Before we can adjust IOB buffer size, let all netdev share a biggest header isn't a major concern. Also, the ratio of header space consumption is small if we compare it to the normal MTU size. Does it worth us to compilate the design(benefit is ~5% in most case).

I can think of some solutions but I'm not sure what would be best:

  1. Make the IOB buffer peripheral specific, hence for the example above we will get 3 separate IOB buffer pools

Yes, it's one solution but how about layer 2 forward(e.g. RNDIS<->Ethernet/WiFi/Modem) which is a feature we plan to add. To achieve the zero copy through the whole path, it require us to reserve the biggest header size and share the same allocator.

  1. Create a 2nd DMA aware IOB iob_dma that has metadata of the DMA descriptor and pheripheral. Where the Network stack can invoke callbacks to the pheripheral to parse the received metadata (also nice to check for HW offloading and then fallback to the SW method)

The special handle can be achieved before netdev pass IOB to TCP/IP stack in receiving direction. The similar thing can be done in the transmit direction too, why we need callback here?

  1. Create IOB buffer pool that supports variable data length, a good example could be https://github.com/pavel-kirienko/o1heap

could be a solution, but we need consider that:

  1. Since it is impossible to forecast the incoming packet size, we have allocate the full packet in the receiving side
  2. Need consider the fragmentation problem and the house keeping overhead
  3. How to accumulate the transmit data to form the full packet without reallocate/copy

Another solution is make IOB buffer smaller than MTU(e.g. 256 instead 1518) and utilize the link list DMA in hardware.

Another thing I want to address is to support the HW offloading bits, there are ethernet MAC's that have IP offloading features such as checksum calculation done by MAC and simply indicate using a bit that the CRC in the DMA descriptor, or indicate that an automatic ARP response has been send.

HW checksum is already supported, you can simply enable NET_ARCH_CHKSUM: https://github.com/apache/incubator-nuttx/blob/master/net/utils/Kconfig#L6-L26

PetervdPerk-NXP commented 2 years ago

The special handle can be achieved before netdev pass IOB to TCP/IP stack in receiving direction. The similar thing can be done in the transmit direction too, why we need callback here?

Te be able to decode the eth_desc_s which contains extra information, such as packet type, timestamp, CRC, MAC filter, VLAN, Error only the MAC driver knows what this means. https://github.com/apache/incubator-nuttx/blob/5d12e350da31324e632ee7da3a3062b05212b74c/arch/arm/src/s32k3xx/hardware/s32k3xx_emac.h#L3076-L3082

HW checksum is already supported, you can simply enable NET_ARCH_CHKSUM: https://github.com/apache/incubator-nuttx/blob/master/net/utils/Kconfig#L6-L26

That API is to pass the data pointer to an Hardware CRC accelerator, in the case of the S32K3XX this doesn't work because the MAC itself has a build-in CRC checker that indicates that in the eth_desc_s IPCE mask when the packet has been received.

#define EMAC_RDES1_IPCE_MASK     (0x00000080u) /* IP Payload Error bit
                                                * IP payload checksum (that is, the TCP, UDP, or ICMP checksum) 
                                                * calculated by the MAC does not match the corresponding checksum 
                                                * field in the received segment. */

could be a solution, but we need consider that:

  1. Since it is impossible to forecast the incoming packet size, we have allocate the full packet in the receiving side
  2. Need consider the fragmentation problem and the house keeping overhead
  3. How to accumulate the transmit data to form the full packet without reallocate/copy
  1. The MTU of the specific interface determines the allocation size i.e. as shown in table above, Ethernet 1518, WIFI 576, CAN2.0B 13.
  2. Something like O1Heap solves the fragmentation, indeed at the expense of overhead but that's the trade-off
  3. See 1 Allocate the MTU of the interface, maybe add 4 bytes for a pointer to link them.

There really not a 1 solution fits all, but I want to make sure though we've got something can easily be adapted to support all MAC controllers with DMA in NuttX.

So if some authors of ethernet MAC drivers could share their thoughts if there's a compatible solution that can be used their driver.

@davids5 @gregory-nutt @acassis

xiaoxiang781216 commented 2 years ago

The special handle can be achieved before netdev pass IOB to TCP/IP stack in receiving direction. The similar thing can be done in the transmit direction too, why we need callback here?

Te be able to decode the eth_desc_s which contains extra information, such as packet type, timestamp, CRC, MAC filter, VLAN, Error only the MAC driver knows what this means.

https://github.com/apache/incubator-nuttx/blob/5d12e350da31324e632ee7da3a3062b05212b74c/arch/arm/src/s32k3xx/hardware/s32k3xx_emac.h#L3076-L3082

Yes, I understand that the MAC layer need additional descriptor before the IP packet. But, the process can directly handle in either irq handler and work callback before pass the data to TCP/IP stack. What I can't understand is why we need TCP/IP stack callback to netdev.

HW checksum is already supported, you can simply enable NET_ARCH_CHKSUM: https://github.com/apache/incubator-nuttx/blob/master/net/utils/Kconfig#L6-L26

That API is to pass the data pointer to an Hardware CRC accelerator, in the case of the S32K3XX this doesn't work because the MAC itself has a build-in CRC checker that indicates that in the eth_desc_s IPCE mask when the packet has been received.

#define EMAC_RDES1_IPCE_MASK     (0x00000080u) /* IP Payload Error bit
                                                * IP payload checksum (that is, the TCP, UDP, or ICMP checksum) 
                                                * calculated by the MAC does not match the corresponding checksum 
                                                * field in the received segment. */

So, the hardware can check the checksum for receiving, but can't generate checksum for sending? In this case, we may need add new option to disable the checksum for one direction like CONFIG_NET_UDP_CHECKSUMS.

could be a solution, but we need consider that:

  1. Since it is impossible to forecast the incoming packet size, we have allocate the full packet in the receiving side
  2. Need consider the fragmentation problem and the house keeping overhead
  3. How to accumulate the transmit data to form the full packet without reallocate/copy
  1. The MTU of the specific interface determines the allocation size i.e. as shown in table above, Ethernet 1518, WIFI 576, CAN2.0B 13.
  2. Something like O1Heap solves the fragmentation, indeed at the expense of overhead but that's the trade-off
  3. See 1 Allocate the MTU of the interface, maybe add 4 bytes for a pointer to link them.

There really not a 1 solution fits all, but I want to make sure though we've got something can easily be adapted to support all MAC controllers with DMA in NuttX.

Sure. another possible solution is to reuse IOB chain, so we can define a small IOB buffer size(the smallest MTU on the device) and link multiple IOB for the bigger MTU.

PetervdPerk-NXP commented 2 years ago

So, the hardware can check the checksum for receiving, but can't generate checksum for sending? In this case, we may need add new option to disable the checksum for one direction like CONFIG_NET_UDP_CHECKSUMS.

TX Checksum is done in the MAC as well, a solution for that is just to enable NET_ARCH_CHKSUM for TX only and make a dummy function, since MAC itself will fill the dummy bytes with the checksum.

But, the process can directly handle in either irq handler and work callback before pass the data to TCP/IP stack. What I can't understand is why we need TCP/IP stack callback to netdev.

Most of it can be done in the IRQ handler, but the reference that the TCP/UDP checksum was correct gets lost then, hence the need for callback. I guess we can also just drop the packet when we see this but then the TCP/UDP wouldn't know that a corrupted packet has been received.

xiaoxiang781216 commented 2 years ago

Most of it can be done in the IRQ handler, but the reference that the TCP/UDP checksum was correct gets lost then, hence the need for callback. I guess we can also just drop the packet when we see this but then the TCP/UDP wouldn't know that a corrupted packet has been received.

This can be corrected by increasing the count in g_netstats directly in irq/work handler.

davids5 commented 2 years ago

@PetervdPerk-NXP - This is really nice to see. The Diagrams rock!

We should definitely be using Scatter Gather DMA. The approach of using many little IOBs will cause a high load on the number of statically allocated DMA descriptors (TCD) needed (32 bytes per and some non-net devices need 4-6 per transaction already). There may be a size for the built in data IOB that is a good balance. An alternate could be to reference the data and carry a size and used and maybe a next reference in the IOB and not include the data allocation in the struct. These can be allocated at initialization time and static at run time. Then pools can be formed and used by the devices with MTU requirement. It could fail over to allocating a bigger MTU on smaller starvation and manage it with the size and used to return it to the correct pool.