Potential bug for EKF2 - Githubissues

Changliu52 commented 6 years ago

Hi @mhkabir @dagar @priseborough,

Environment: 1.7.3 generic quadrotor x, with the non-standard EKF2 parameters as below: EKF2_AID_MASK: vision position fusion + vision yaw fusion EKF2_EV_DELAY: 0.0ms EKF2_HGT_MODE: Vision

When flying, I was using an onboard SLAM to produce vision pose. at about 5:01:407 (same time as the warning message: [lib_ecl] EKF ev hgt timeout - reset to baro), the drone shooted up and crashed into the ceiling. log here: https://review.px4.io/plot_app?log=cd3735b0-db2a-4ada-b7a1-240b5ef63c37

Several points I don't understand here:

for local_position.z, when it decided to use baro at 5:01:407, estimator_status_0.states[9] is reset, but vehicle_local_position_0.z is not reset, and still stuck with the complementary filter as below capture. This results in hug negative local_position.z value, and high positive local_position.vz value. The drone think itself quickly dropping thus give almost max thrust to climb. Do local_position.z and local_position.vz need to be reset at the same time?
for some reason, vision_pose_estimation was rejected before triggering the failsafe at 4:57:408 and reset local_position.x/y to zero (see the capture below). I don't understand the reason for that, because it doesn't really match the statement in the observation errors section in link: https://dev.px4.io/en/tutorials/tuning_the_ecl_ekf.html, because the innovation_check_flags only showed the rejection of altitude measurement (4) as shown in the capture

Please help, and I believe the SLAM outputs quite decent measurement, maybe could your share some experiance to avoid the mysterious reset of the local_position.x/y?

Thanks in advance. Kind Regards, Chang

mhkabir commented 6 years ago

@priseborough could you please have a look?

mhkabir commented 6 years ago

@Changliu52 Possible fix here : https://github.com/PX4/ecl/pull/418. Could you please test?

Changliu52 commented 6 years ago

Thanks for the update @mhkabir @priseborough I have just done a handheld test using the current master firmeare (1.7.4dev) with the ecl fix. flight review here https://review.px4.io/plot_app?log=e015fdb5-ab7f-471d-90d6-ab8c05d42a47 x local position fusion shown here (updated)

z local position fusion shown here

The improvements are:

in the event of reset, both local position and ekf states are reset correctly and making a lot more sense. So my first point seems to be solved.
innovation_check_flags seems to went to different values, mostly 64 (x1000000) and and sometimes 68 (x1000100), although still doesn't make sense from https://github.com/PX4/Firmware/blob/master/msg/estimator_status.msg#L69.

Still problems are: As shown clearly in both graphs, for some reason, the filter still not happy with the vision pose and rejecting very frequently. I am sure its not the timeout, so I ploted the vel_pos_innov_var[3] and pos_test_ratio in the figure below, but the rejection was triggered regardless of the comparison of the two values. (Updated)

I have also tested with the LPE in 1.7.3, which flies the drone pretty smoothly without any problem.

Let me know if you have any thoughs.

Kind Regards, Chang

Changliu52 commented 6 years ago

Also I have done a test for vertial failsafe. I disabled the vision during 'flight', trying to trigger the failsafe switch from vision to baro. The switch is seamless.

priseborough commented 6 years ago

I will have a look at your last log.

One possibility to consider is time stamping. The ekf2 runs on a delayed tine horizon and uses the time stamp of the observation to retrieve data so that everything gets fused at a consistent time.

However this can be a problem with offboard data because the data buffers are only large enough to store data in a window that extends from current time back to the fusion time horizon which is typically 100 to 200 msec wide depending on the value of the largest time delay parameter. If time stamping is out by more than 50msec or input data is coming in at too fast a rate it can result in observations being overwritten in the buffer by new ones before they are used by the EKF.

mhkabir commented 6 years ago

For such cases where we set EKF2_EV_DELAY to 0 to use device timestamps for the buffer lookup, I think we should introduce a EKF2_DELAY_MAX parameter to set the size of the buffer, since otherwise it will most likely be smaller than needed for vision (usually around 200ms)

@priseborough What do you think?

priseborough commented 6 years ago

The same result can be achieved by setting the time delay for an unused sensor to a larger value. Looking at the parameters saved with the log, the GPS delay parameter is currently the largest at 110 msec. If you want to try a larger buffer length, set EKF2_ASP_DELAY to 200 which will increase the length of all buffers.

We can add a catch all parameter in the future.

@Changliu52 What is the data rate of your SLAM data?

Edit: It looks like it is around 5 Hz but with a lot of timing jitter.

priseborough commented 6 years ago

Buffer overflow due to data rate should not be an issue then, however the time stamping still needs be correct.

Changliu52 commented 6 years ago

Hi @mhkabir @priseborough Thank you for the spot-on suggestions!

I think buffer overwrite could well be the problem, because:

for the timestamp I have the camera-px4imu-slam hardware synchronized. although not yet compensate for shutter time yet (generally about 1-3ms), I am generally happy with it.
I am 100% sure I am streaming the slam at 11Hz, so if the px4 getting about 5Hz data, it has to be overwritten somehow.
I am aware that the slam pose has about 200ms delay, although with relatively accurate timestamp, it does need a big buffer for the delayed time horizon.

I will test again with lower streaming frequency (5Hz) and put the EKF2_ASP_DELAY to 200 to allow the buffer. Is 200ms the maximum value?

Update: I remember before I tested 400ms in EFK2_EV_DELAY, as the maximum shown in the description in QGC, but ekf simply doesn't initialise.

Kind Regards, Chang

priseborough commented 6 years ago

The 5Hz estimate was based on counting the number of logged vehicle_vision_position pose data points over a 3 second interval in the log. I counted 14. It had nothing to do with the EKF. The EKF is getting the data out of the buffers much less frequently than that- see the innovations.

From memory the EKF can compensate for time delays of the data relative to its time stamp time up to 250 msec. The EKF uses the IMU data time stamps as its internal time reference. If the time stamp of the external vision data accurately reflected the time of the EV measurement relative to the IMU data, then the EKF2_EV_DELAY parameter could be set to zero.

Looking at the code here: https://github.com/PX4/Firmware/blob/master/src/modules/ekf2/ekf2_main.cpp#L914-L931 it appears that either the vision position or vision attitude message timestamps can be passed to the EKF via the external vision interface, so that is something else to check.

Changliu52 commented 6 years ago

Awesome! When I set EFK2_ASP_DELAY to 250ms, everything works like a charm! Checkout the log for handhold test here: https://review.px4.io/plot_app?log=e5b1feb2-8f73-49ab-b1b0-6a08e28778de

X local position fusion here: screen shot 2018-04-05 at 21 53 01 2 Z local position fusion here:

Also, I figured, the reason for the rough 5Hz update rate is that the SLAM might only be able to catch up with 5 FPS, due to my recent increase in video resolution. I assume if PX4 receive the same vision pose twice with the same timestamps, it will result in overlapping points in the log, right?

Many thanks for the help!

Kind Regards, Chang

PX4 / PX4-Autopilot

Potential bug for EKF2 #9182