PX4 / PX4-Autopilot

PX4 Autopilot Software
https://px4.io
BSD 3-Clause "New" or "Revised" License
8.16k stars 13.36k forks source link

Tailsitter SITL Test Failure #21229

Open junwoo091400 opened 1 year ago

junwoo091400 commented 1 year ago

Describe the bug

[  19.039|mavsdk_tests] [11:17:18|Debug] MAVLink: critical: Preflight Fail: vertical velocity unstable (system_impl.cpp:242)
[  19.039|mavsdk_tests] [11:17:18|Debug] MAVLink: critical: Preflight Fail: Attitude failure (roll) (system_impl.cpp:242)
[  19.039|mavsdk_tests] [11:17:18|Debug] MAVLink: info: Preflight Fail: No manual control input  (system_impl.cpp:242)
[  19.039|mavsdk_tests] Current speed factor: 1.99231 (set: 20)
[  19.652|px4       ] WARN  [health_and_arming_checks] Preflight Fail: vertical velocity unstable
[  19.652|px4       ] WARN  [health_and_arming_checks] Preflight Fail: height estimate error
[  19.652|px4       ] WARN  [health_and_arming_checks] Preflight Fail: Attitude failure (roll)
[  19.652|px4       ] INFO  [health_and_arming_checks] Preflight Fail: No manual control input  
[  19.880|mavsdk_tests] [11:17:19|Debug] MAVLink: critical: Preflight Fail: vertical velocity unstable (system_impl.cpp:242)
[  19.880|mavsdk_tests] [11:17:19|Debug] MAVLink: critical: Preflight Fail: height estimate error (system_impl.cpp:242)
[  19.880|mavsdk_tests] [11:17:19|Debug] MAVLink: critical: Preflight Fail: Attitude failure (roll) (system_impl.cpp:242)
[  19.880|mavsdk_tests] [11:17:19|Debug] MAVLink: info: Preflight Fail: No manual control input  (system_impl.cpp:242)
[  19.880|mavsdk_tests] [11:17:19|Info ] Timeout, connected to vehicle but waiting for test for 20.1 seconds
[  19.880|mavsdk_tests] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[  19.880|mavsdk_tests] mavsdk_tests is a Catch v2.13.8 host application.
[  19.880|mavsdk_tests] Run with -? for options
[  19.880|mavsdk_tests] -------------------------------------------------------------------------------
[  19.880|mavsdk_tests] Fly forward in altitude control
[  19.880|mavsdk_tests] -------------------------------------------------------------------------------
[  19.880|mavsdk_tests] ../../../test/mavsdk_tests/test_multicopter_manual.cpp:50
[  19.880|mavsdk_tests] ...............................................................................
[  19.880|mavsdk_tests] ../../../test/mavsdk_tests/autopilot_tester.cpp:97: FAILED:
[  19.880|mavsdk_tests]   CHECK( poll_condition_with_timeout( [this]() { return _telemetry->health().is_armable; }, std::chrono::seconds(20)) )
[  19.880|mavsdk_tests] with expansion:
[  19.880|mavsdk_tests]   false
[  19.886|mavsdk_tests] [11:17:19|Info ] Waiting to get home position
[  20.009|mavsdk_tests] [11:17:19|Warn ] command temporarily rejected (400). (mavlink_command_sender.cpp:205)
[  20.010|mavsdk_tests] ../../../test/mavsdk_tests/autopilot_tester.cpp:169: FAILED:
[  20.010|mavsdk_tests]   REQUIRE( result == Action::Result::Success )
[  20.010|mavsdk_tests] with expansion:
[  20.010|mavsdk_tests]   Command Denied == Success
[  20.057|mavsdk_tests] ===============================================================================

Example failure run: https://github.com/PX4/PX4-Autopilot/actions/runs/4299642479/jobs/7495070876

The tailsitter SITL Test has been failing since around 2 weeks ago.

image

Diagnosis

It seems to be happening due to preflight failures (vertical velocity, height estimate, etc unstable).

I also noticed this in real drone using main branch (often the preflight failures would occur easily, with GPS sensor well positioned outside). So I am suspecting it probably is PX4 estimation issue.

Resource

junwoo091400 commented 1 year ago

Note, it is interesting that https://github.com/PX4/PX4-Autopilot/pull/21190 passed the SITL Test, and it is indeed somewhat related to the tailsitter. But most likely unrelated coincidence.

As expected, it did fail a SITL Test run afterwards: https://github.com/PX4/PX4-Autopilot/actions/runs/4261478422/jobs/7415849888#step:17:607

junwoo091400 commented 1 year ago

Currently troubleshooting via compiling MAVSDK tests:

PX4_CMAKE_BUILD_TYPE=RelWithDebInfo test/mavsdk_tests/mavsdk_test_runner.py --speed-factor 20 --abort-early --model tailsitter test/mavsdk_tests/configs/sitl.json --verbose
junwoo091400 commented 1 year ago
  1. The failure is happening where the _telemetry->health().is_armable isn't becoming true for 20 seconds (timeout). It means, vehicle isn't armable for the preflight failure reasons image
  2. When testing with standard_vtol, I can verify that the preflight errors don't occur, and that MAVSDK test runs successfully. So it is a bug with tailsitter.
  3. I can verify visually that with QGC connected, the sate estimation for the tailsitter of yaw is rotating, signaling that somehow the estimation is wrong.
  4. Furthermore, attitude shown by QGC doesn't make much sense image Screenshot from 2023-03-01 18-51-35

This leads me to conclude that state estimation for tailsitter is broken. @Jaeyoung-Lim @bresch could you check what went wrong on this?

junwoo091400 commented 1 year ago

However, note that when I start the tailsitter sim as standalone: make px4_sitl gazebo-classic_tailsitter, I don't have this attitude estimation being a weird error.

This means that MAVSDK script is somehow messing up the tailsitter simulation

junwoo091400 commented 1 year ago

image

I can confirm that by following the model spawning workflow defined inside the https://github.com/PX4/PX4-Autopilot/blob/main/test/mavsdk_tests/process_helper.py, the spawned model shows erratic attitude estimation behavior.

Thus, finding the difference in standard SITL process of make px4_sitl gazebo-classic_tailsitter and the following is the key to finding where the estimation going wrong is coming from.

Steps:

  1. Start PX4 instance: PX4_SIM_MODEL=gazebo-classic_tailsitter /home/junwoo/Coding/PX4-Autopilot/build/px4_sitl_default/bin/px4 /home/junwoo/Coding/PX4-Autopilot/build/px4_sitl_default/etc -s etc/init.d-posix/rcS -t /home/junwoo/Coding/PX4-Autopilot/test_data -d
  2. Start Gazebo server: GAZEBO_PLUGIN_PATH=/home/junwoo/Coding/PX4-Autopilot/build/px4_sitl_default/build_gazebo-classic/ GAZEBO_MODEL_PATH=/home/junwoo/Coding/PX4-Autopilot/Tools/simulation/gazebo-classic/sitl_gazebo-classic/models/ stdbuf -o0 -e0 gzserver --verbose /home/junwoo/Coding/PX4-Autopilot/Tools/simulation/gazebo-classic/sitl_gazebo-classic/worlds/empty.world
  3. Spawn tailsitter model: GAZEBO_MODEL_PATH=/home/junwoo/Coding/PX4-Autopilot/Tools/simulation/gazebo-classic/sitl_gazebo-classic/models/ stdbuf -o0 -e0 gz model --verbose --spawn-file /home/junwoo/Coding/PX4-Autopilot/Tools/simulation/gazebo-classic/sitl_gazebo-classic/models/tailsitter/tailsitter.sdf --model-name tailsitter -x 1.01 -y 0.98 -z 0.83
junwoo091400 commented 1 year ago

image

Further analysis:

  1. Starting PX4 instance & Gazebo server as shown in manual step above, then executing make px4_sitl gazebo-classic_tailsitter results in weird attitude estimation & vehicle is jumping around. This means that either PX4 instance or Gazebo server is causing this weird behavior (as model spawning was done by the standard script)
  2. Starting PX4 instance manually, then running make px4_sitl gazebo-classic_tailsitter results also in erradic state estimation (which does settle after a bit). This requires removing this line: https://github.com/PX4/PX4-Autopilot/blob/bde194fb12342e4c115ce5a88be7ba839509442e/Tools/simulation/gazebo-classic/sitl_run.sh#L173. Overall, this means that the gazebo server spawning part of the MAVSDK script isn't causing the problem.
  3. This leads to the conclusion that the PX4 instantiation is the problem. However, when I start the PX4 instance as it is done exactly in https://github.com/PX4/PX4-Autopilot/blob/bde194fb12342e4c115ce5a88be7ba839509442e/Tools/simulation/gazebo-classic/sitl_run.sh#L154: PX4_SIM_MODEL=gazebo-classic_tailsitter PX4_SIM_WORLD=none /home/junwoo/Coding/PX4-Autopilot/build/px4_sitl_default/bin/px4 /home/junwoo/Coding/PX4-Autopilot/build/px4_sitl_default/etc, the erratic state estimation is still there. So this leads to the conclusion that all the steps (px4 instance / gazebo server / gazebo model spawning) are not the ones causing this behavior, but is something else.
junwoo091400 commented 1 year ago

image

Probably not related, but the only nitpick difference I could find with MAVSDK test script and native sitl build command was that the tmp_masvdk_tests/rootfs (used as current working directory for PX4 instantiation for MAVSDK test) and rootfs (used for standard PX4 SITL build) folder contents differ.

Screenshot from 2023-03-01 19-52-39

Especially, the parameter set seems to differ for some reason.

However, it can't be said this is the cause, as I wasn't using any of the tmp_mavsdk_tests/rootfs contents when I was running the px4 instance manually, and was still getting weird behavior. So probably unrelated, but still an interesting difference. @julianoes do you know why the contents of these two rootfs would differ? Resolved in comment below.

junwoo091400 commented 1 year ago

image

Further decoding of the bson file showed that only 'COM_DL_LOSS_T': 200, 'COM_OBC_LOSS_T': 100.0, 'COM_OF_LOSS_T': 10.0, 'COM_RC_LOSS_T': 10.0 is present in MAVSDK rootfs, but that's the only difference.

And, this is of course (suspected) set via the MAVSDK commands inside the script, so this rootfs difference isn't the reason why the state estimation fails.

junwoo091400 commented 1 year ago

3. This leads to the conclusion that the PX4 instantiation is the problem.

image

I can confirm that PX4 instantiation process is the problem, as when I start manually gazebo server & model spawning (haven't tried in any of the steps above), and only let the make px4_sitl_default gazebo-classic_tailsitter start the PX4 process, the estimation error doesn't occur.

It is just not clear where the PX4 process starting differs when I do it manually vs via the sitl_run.sh script.

junwoo091400 commented 1 year ago

I was able to narrow down the problem to being the 'order of PX4 instance & Gazebo server being started'.

I was sure that the command PX4_SIM_MODEL=gazebo-classic_tailsitter /home/junwoo/Coding/PX4-Autopilot/build/px4_sitl_default/bin/px4 /home/junwoo/Coding/PX4-Autopilot/build/px4_sitl_default/etc was the same as used by the https://github.com/PX4/PX4-Autopilot/blob/main/Tools/simulation/gazebo-classic/sitl_run.sh. But then it did occur to me that in order to not have the make command start the px4 instance, I was always starting the manual px4 instance FIRST, before the gazebo server.

So, I just added a line exit here: https://github.com/PX4/PX4-Autopilot/blob/bde194fb12342e4c115ce5a88be7ba839509442e/Tools/simulation/gazebo-classic/sitl_run.sh#L148, and ran that command make px4_sitl_default gazebo-classic_tailsitter, then ran the PX4 instance manually, then bam. The weird attitude situation was gone.

Now I am certain that the problem was:

  1. Starting PX4 instance
  2. Starting Gazebo Server

And probably PX4 was freaking out when it had no data stream & when it received long after it got started, the estimation went crazy.

For sure this is not a desirable behavior (we don't want the order of simulator / px4 instance instantiation to affect estimation behavior), but at least the solution is clear: Start the simulation first in MAVSDK TEST!

junwoo091400 commented 1 year ago

The order fix correctly allows 3 cases out of 4 tailsitter tests to pass!

For the 'VTOL Mission' case, it seems that Failure detector thinks while the VTOL transitions into FW, the maximum pitch (60 degrees) is reached, whereas in reality I think it's just a matter of body attitude conversion hickup (as tailsitter needs to rotate frame of reference as it transitions by 90 deg) @sfuhrer any analysis on this?

Interestingly, the SITL Test action that succeeded also had the pitch failure detector issue. So I am wondering why in my local MAVSDK Test the VTOL entered failsafe & landed :thinking:

This is a separate issue, and isn't related to the MAVSDK test failure case itself.

image

Log: https://logs.px4.io/plot_app?log=95fcd187-babd-4a9e-9a0f-748c1865d5e1

julianoes commented 1 year ago

And probably PX4 was freaking out when it had no data stream & when it received long after it got started, the estimation went crazy.

That sounds odd. I would suggest we investigate that as well though. It shouldn't depend and freak out otherwise.

junwoo091400 commented 1 year ago

And probably PX4 was freaking out when it had no data stream & when it received long after it got started, the estimation went crazy.

That sounds odd. I would suggest we investigate that as well though. It shouldn't depend and freak out otherwise.

I agree! Also it's quite weird that this only happens to tailsitter model more frequently :thinking:

junwoo091400 commented 1 year ago

For the 'VTOL Mission' case, it seems that Failure detector thinks while the VTOL transitions into FW, the maximum pitch (60 degrees) is reached

Note, I suspected it may be somehow related to https://github.com/PX4/PX4-Autopilot/pull/20904, but the SITL test failure is resulting from a commit with that PR included, so I don't think that's the cause

junwoo091400 commented 1 year ago

As you asked during the FW call @dagar this test from local environment shows that GNSS error as pointed out by Jay. So yes, the local & CI errors are the same reason.

image

And Silvan suggested testing with multi EKF disabled, but that hasn't been done by me yet 😞

junwoo091400 commented 1 year ago

The failure is happening again (e.g. here), so I re-opened the issue.

The failure mode exhibits 2 traits:

  1. Only happens in mission mode (taken from end of the MAVSDK Test Run CI output):
    • 'Takeoff and Land': succeeded
    • 'Fly forward in position control': succeeded
    • 'Fly forward in altitude control': succeeded
    • 'Fly VTOL mission': failed
  2. Has preflight fail on vertical velocity / height estimate, and attitude failure (roll)
    • Note, the "Airspeed selector module down" and "ekf2 missing data" preflight fail also happens in other test cases (other than mission), so it probably isn't the one causing the failure.
[   4.331|px4       ] WARN  [health_and_arming_checks] Preflight Fail: Airspeed selector module down
[   4.331|px4       ] WARN  [health_and_arming_checks] Preflight Fail: ekf2 missing data
...
[   4.832|px4       ] WARN  [health_and_arming_checks] Preflight Fail: vertical velocity unstable
[   4.832|px4       ] WARN  [health_and_arming_checks] Preflight Fail: height estimate error
[   4.832|px4       ] WARN  [health_and_arming_checks] Preflight Fail: Attitude failure (roll)

The failure mode is slightly different than before https://github.com/PX4/PX4-Autopilot/pull/21319 was merged, in a sense that before that PR, the tailsitter would fail during the transition, but now it is failing to even arm.

junwoo091400 commented 1 year ago

image

Seems like there's a huge Z position / velocity estimate error! @bresch any initial guess on what's wrong?

Log is from the failed CI run's output

bresch commented 1 year ago

@junwoo091400 Yes, I'm aware of that bug. It is caused by a bad initial attitude (the tailsitter is like a multirotor in hover mode, roll and pitch should be 0). I cannot reproduce it in replay, even with the CI logs (ekf2 replay was enabled to investigate that bug). It might be due to an initial sample that we don't log but that is used by CI while running the test. The interesting thing is that it only occurs on the tailsitter CI and not every time (if you restart the test, it will most likely not fail again).

junwoo091400 commented 1 year ago

It might be due to an initial sample that we don't log but that is used by CI while running the test

Which sample are you referring to here? Sensor data? Or the initial attitude estimate?

bresch commented 1 year ago

Initial sensor data. We might want to be a bit more careful with the measurements we're using while initializing the EKF