PX4 / PX4-Autopilot

PX4 Autopilot Software
https://px4.io
BSD 3-Clause "New" or "Revised" License
8.21k stars 13.38k forks source link

[Bug] Recent DMA change ed4814f6239097dde5eecf2b4fbd58661db84dda stm32h7/serial: Do not wait on TXDMA semaphore breaks a lot of things. #22399

Closed davids5 closed 7 months ago

davids5 commented 10 months ago

Describe the bug

With stm32h7/serial: Do not wait on TXDMA semaphore in, the console will not complete transmitting the complete output until a character is input on console.

gps status

sh> gps status
INFO  [gps] Main GPS
INFO  [gps] protocol: UBX
INFO  [gps] status: OK, port: /dev/ttyS0, baudrate: 115200
INFO  [gps] sat info: disabled
INFO  [gps] rate reading:                  817 B/s
INFO  [gps] rate position:                5.00 Hz
INFO  [gps] rate velocity:                5.00 Hz
INFO  [gps] rate publication:             5.00 Hz
INFO  [gps] rate RTCM injection:          0.00 Hz
 sensor_gps
    timestamp: 402227138 (0.141411 seconds ago)

\< hit a key >

    timestamp_sample: 0
    latitude_deg: 27.589204
    longitude_deg: -80.435646
    altitude_msl_m: 21.876000
    altitude_ellipsoid_m: -6.741000
    time_utc_usec: 1700158702599596
    device_id: 11010053 (Type: 0xA8, SERIAL:0 (0x00))
    s_variance_m_s: 0.91500
    c_variance_rad: 0.77875
    eph: 4.81900
    epv: 6.55500
    hdop: 1.67000
    vdop: 2.80000
    noise_per_ms: 98
    jamming_indicator: 47
    vel_m_s: 0.04700
    vel_n_m_s: 0.02800
    vel_e_m_s: -0.03800
    vel_d_m_s: 0.00200
    cog_rad: 1.72410
    timestamp_time_relative: 0
    heading: nan
    heading_offset: 0.00000
    heading_accuracy: 0.00000
    rtcm_injection_rate: 0.00000
    automatic_gain_control: 780
    fix_type: 3
    jamming_state: 0
    spoofing_state: 1
    vel_ned_valid: True
    satellites_used: 8
    selected_rtcm_instance: 0
    rtcm_crc_failed: False
    rtcm_msg_used: 0

To Reproduce

see above

Expected behavior

not extra input needed

Screenshot / Media

No response

Flight Log

NA

Software Version

main with nuttx that has ed4814f6239097dde5eecf2b4fbd58661db84dda

Flight controller

all H7 with TX DMA

Vehicle type

None

How are the different components wired up (including port information)

No response

Additional context

No response

davids5 commented 10 months ago

I have have also seen where gps driver on ports that have TXDMA will not work. (v6xrt)

AlexKlimaj commented 10 months ago

I am seeing the same thing on the ARKV6X.

PetervdPerk-NXP commented 10 months ago

I think the TXDMA semaphore change is wrong and should be reverted the way it was. The write() function by posix definition can block indefinitely. https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html

The issue #22302 tried to fix was not a problem in the NuttX kernel. But a problem in the mavlink application itself since it opens the uart in blocking mode. https://github.com/PX4/PX4-Autopilot/blob/62027b09655c63dd71808481c30ed397fc25aa52/src/modules/mavlink/mavlink_main.cpp#L587

My proposed solution is that mavlink open the uart using O_NONBLOCK instead. And should handles the EAGAIN return code.

julianoes commented 10 months ago

Can someone point me to a commit where things are still sane?

Edit: looks like main is ok again?

davids5 commented 10 months ago

no.

These break stuff. Working on a fix,,,

[BACKPORT] s32k1xx:Serial Do not wait on TXDMA semaphore

[BACKPORT] s32k3xx:Serial Do not wait on TXDMA semaphore

[BACKPORT] imxrt:Serial Do not wait on TXDMA semaphore

[BACKPORT] stm32h7/serial: Do not wait on TXDMA semaphore

niklaut commented 10 months ago

But then should the implementation not be:

  1. UART configured as blocking -> nxsem_wait.
  2. UART configured as non-blocking -> nxsem_trywait.
PetervdPerk-NXP commented 10 months ago

up_dma_txavailable should get a return value and the NuttX serial upper driver should be extended as well to feedback EAGAIN to userspace. https://github.com/PX4/NuttX/blob/e7da5ac0e660238cb353948b2a9118579727267a/drivers/serial/serial.c#L293 https://github.com/PX4/NuttX/blob/e7da5ac0e660238cb353948b2a9118579727267a/drivers/serial/serial.c#L494 https://github.com/PX4/NuttX/blob/e7da5ac0e660238cb353948b2a9118579727267a/drivers/serial/serial.c#L1261

Edit: Maybe we don't even need EAGAIN but I'm afraid we lose data if we don't. Need some more investigation.

davids5 commented 10 months ago

I need to review the sequence for the Transmit.

davids5 commented 10 months ago

I have a nuttx branch with

The stm32h7 fix: https://github.com/PX4/NuttX/commit/c696d5e1b0a98f1d976467576f6d93ff8ce61423 The imx: fix: https://github.com/PX4/NuttX/commit/2de00138371eb634de22fa708234006d78b2ea13 https://github.com/PX4/NuttX/commit/52c5524a44a7ff4c33d33865dc312510adea21a5 https://github.com/PX4/NuttX/commit/8cae36b737e0d3156691781fdfcfb17af8c26e41

The console requiring a second key press is fixed. serial_test runs at 3Mbps and with https://github.com/PX4/PX4-GPSDrivers/pull/143 the imxrt can use TX DMA again.

@niklaut can you pull in the stm32h7 and test the original issue with mavlink?

beniaminopozzan commented 10 months ago

As @dakejahl mentioned, this issue is also heavly affecting the uxrce_dds_client. After #22302 the RTT of the xrce-dds communication with a companion computer increased from 1.5ms to 20ms (testing on a Raspberry Pi 4, serial hardware on the raspberry (serial5), CUAV pixhawk 6x, TELEM1 used on the CUAV at 921600b/s). 20ms of RTT makes the time synchronization impossible.

@davids5 , I tested your nuttx branch and it definitively improves but it does not remove entirely the issue as the RTT is now around 6ms.

davids5 commented 10 months ago

As @dakejahl mentioned, this issue is also heavly affecting the uxrce_dds_client. After #22302 the RTT of the xrce-dds communication with a companion computer increased from 1.5ms to 20ms (testing on a Raspberry Pi 4, serial hardware on the raspberry (serial5), CUAV pixhawk 6x, TELEM1 used on the CUAV at 921600b/s). 20ms of RTT makes the time synchronization impossible.

@davids5 , I tested your nuttx branch and it definitively improves but it does not remove entirely the issue as the RTT is now around 6ms.

@beniaminopozzan am I following that if you just revert the H7 commit that created the problem you get some timing that 1.5ms. With that commit it goes to to 20 and then with px4_firmware_nuttx-10.3.0%2B-serial-fixes-testing it is 6ms?

beniaminopozzan commented 10 months ago

As @dakejahl mentioned, this issue is also heavly affecting the uxrce_dds_client. After #22302 the RTT of the xrce-dds communication with a companion computer increased from 1.5ms to 20ms (testing on a Raspberry Pi 4, serial hardware on the raspberry (serial5), CUAV pixhawk 6x, TELEM1 used on the CUAV at 921600b/s). 20ms of RTT makes the time synchronization impossible. @davids5 , I tested your nuttx branch and it definitively improves but it does not remove entirely the issue as the RTT is now around 6ms.

@beniaminopozzan am I following that if you just revert the H7 commit that created the problem you get some timing that 1.5ms. With that commit it goes to to 20 and then with px4_firmware_nuttx-10.3.0%2B-serial-fixes-testing it is 6ms?

@davids5 , that's correct!

davids5 commented 10 months ago

@beniaminopozzan can you describe the setup (wiring and commands) so I can reproduce the issue?

beniaminopozzan commented 10 months ago

@davids5 Of course

Raspberry Pi 4 Setup

PX4 setup

Parameter value
UXRCE_DDS_CFG TELEM 1
MAV_0_CONFIG Disabled
SER_TEL1_BAUD 921600

Hardware connections

The CUAV TELEM1 is directly connected to uart4 of the Raspberry. The Rasberry RX4 is on PIN21 while TX4 is on PIN24.

Testing procedure

  1. Power on the CUAV
  2. Power on the Raspberry and start the microXRCE-DDS agent. On my setup the command is MicroXRCEAgent serial --dev /dev/ttyAMA1 -b 921600
  3. On a new terminal, echo the /fmu/out/timesync_status ROS2 topic and look for the round_trip_time field. It's value is in microseconds.

Altenative setup

Potentially, you can use any usb to serial adapter so that the raspberry is not neeeded. However I always found the quite unreliable.

niklaut commented 9 months ago

@beniaminopozzan You're right, there are differences in timing even on an unconnected TELEM1: https://github.com/PX4/NuttX/pull/285#issuecomment-1825948853.

DronecodeBot commented 9 months ago

This issue has been mentioned on Discussion Forum for PX4, Pixhawk, QGroundControl, MAVSDK, MAVLink. There might be relevant details there:

https://discuss.px4.io/t/gps2-on-cube-orange-mini-mapping-possibly-incorrect/35512/6

beniaminopozzan commented 9 months ago

@davids5 any news on this?

AlexKlimaj commented 9 months ago

Looks like the two commits still haven't fixed this. Run it long enough and TX gets stuck.

AlexKlimaj commented 8 months ago

I thought removing O_NONBLOCK from the dds client driver fixed it but it just looks like the whole driver locks up when the serial port freezes.

With O_NONBLOCK, the driver thinks its sending data but nothing gets received by the agent. RX into the client is still fully working.

Without O_NONBLOCK, the entire driver locks up.

davids5 commented 8 months ago

I thought removing O_NONBLOCK from the dds client driver fixed it but it just looks like the whole driver locks up when the serial port freezes.

With O_NONBLOCK, the driver thinks its sending data but nothing gets received by the agent. RX into the client is still fully working.

Without O_NONBLOCK, the entire driver locks up.

How many threads is it? Maye you can walk me through that you have learned from uxrce_dds_client/uxrce_dds_client.cpp.

davids5 commented 8 months ago

I thought removing O_NONBLOCK from the dds client driver fixed it but it just looks like the whole driver locks up when the serial port freezes. With O_NONBLOCK, the driver thinks its sending data but nothing gets received by the agent. RX into the client is still fully working. Without O_NONBLOCK, the entire driver locks up.

How many threads is it? Maye you can walk me through that you have learned from uxrce_dds_client/uxrce_dds_client.cpp.

@AlexKlimaj can you scope TX, RX and the RTS, CTS lines at the point of failure and post it here?

AlexKlimaj commented 8 months ago

I'm testing using mavlink now instead of DDS. Running into the same disconnection problem. It will run for a while, then lock up.

AlexKlimaj commented 8 months ago

I can scope next week but I just tested current main with telem 2 on mavlink and it barely runs for 10 minutes without disconnecting. Main with the dma commits reverted has been running for 4 hours

davids5 commented 7 months ago

@beniaminopozzan The serial DMA issue was fixed with https://github.com/PX4/PX4-Autopilot/pull/22667. Please test current master. If this is still and issue please reopen this or post a new issues with what you are seeing.