Orientation estimation error and delay investigation

matemat13 commented 9 months ago

This issue documents our findings so far regarding inaccuracies and delays in orientation estimation when using PixHawk with MAVROS and the MRS UAV state estimators. It should also be the place to discuss possible remedies and ideas for improvement.

Context

We have noticed that during agile maneuvers with high angular velocities, the projection of points from a 3D LiDAR to a static world frame is not stable. Upon further investigation it turns out that this is caused by inaccurate estimation of the UAV's own orientation. This problem increases with increasing angular velocities.

This is a problem for basically all situations where flight with high angular velocities is required, which mostly relates to Eagle and agile RL planning.

Methodology

To test the hypothesis that the orientation estimation error correlates with angular velocity around the respective axis, I've set up a Gazebo simulation with a UAV flying using the SE(3) controller and fast constraints along a line segment trajectory oriented with the Y axis with the rtk estimator and with ground truth enabled. The rosbag is available here.

Three topics were used:

/uav1/mavros/imu/data - raw estimate from PixHawk's internal EKF2 algorithm as published by MAVROS.
/uav1/estimation_manager/rtk/odom - output of the MUS estimator which republishes data from the first topic.
/uav1/ground_truth - the ground truth data from Gazebo.

Data from either of the first two estimate topics were compared to the ground truth separately using the following method:

The ground-truth orientations were interpolated to the time stamps from the estimate messages using SciPy's slerp implementation.
The orientations were converted from quaternions to roll-pitch-yaw representation using PlotJuggler's algorithm.
All messages were separated to 19 bins according to the corresponding estimated angular velocity at the time. The bins divide the total range of angular velocities equally.
For each bin, the mean and the standard deviation of the error between the ground-truth pitch angle and the estimated pitch angle were calculated from the corresponding messages.

Optionally, before step 3, one of these two pre-processing steps were applied:

The ground-truth orientations were time-aligned to the estimates by minimizing the mean angle error of the pitch using gradient descent (optimize_time).
- This way, the delay between the two time series was also esimated.
The estimated angles were predicted by the estimated delay using the current velocity estimate to compensate the delay (predict_angle).

The tmux session and evaluation scripts are available in this repo.

What we found out so far

For both estimation topics, three variants were evaluated:

Pitch error with respect to the original (interpolated) ground truth.
Pitch error with respect to the ground truth time-aligned to the estimate using gradient descent (and interpolated).
Error of the pitch angle predicted by the estimated delay using the current pitch velocity estimate with respect to the (interpolated) ground truth.

The resulting graphs are shown below.

MAVROS output:

MAVROS roll error, raw MAVROS roll error, optimized time delay MAVROS roll error, predict angle by 9.6ms

MUS estimator:

MUS estimate roll error, raw MUS estimate roll error, optimized time delay MUS estimate roll error, predict angle by 16ms

Delays

Findings:

Even when using the message's timestamp (and not the time of arrival), the orientation estimates from PixHawk published by MAVROS are delayed cca 10ms (estimated by the pre-processing step a) described above).
The data coming from the MUS estimation manager are delayed by 16ms behind the ground truth (so ~6ms in addition to raw MAVROS data, again using timestamps).
- This is caused by two factors:
  1. MUS estimation manager publishes the output in a timer using the latest received data and with the current time (ros::Time::now()) as the timestamp, not with the timestamp of the data (which comes from multiple sources).
  2. By default in simulation, the PixHawk publishes the orientation (topic /uav1/hw_api/orientation) only on 30Hz instead of 100Hz -- this can be fixed by setting export OLD_PX4_FW=true in ~/.bashrc. This is fixed in ctu-mrs/mrs_uav_gazebo_simulation@721fdd69298c1c8b1fe9141d8d6efe9a06de2ac8.
The delays can be compensated pretty well using simple prediction (see the graphs above) of RPY angles using the corresponding angular velocity estimates.

Questions and ideas:

Where is the MAVROS delay coming from? Shouldn't the messages coming from PixHawk be timestamped?
We could compensate the delays in the MUS estimator, but there may be some problems with gimbal lock and with estimation of the dely with real-world hardware.
We should consider making a real-world rig for testing these delays with rw hardware - consider Kalibr.
The MUS estimator could be rewritten to publish the pose estimates with the timestamp of the orientation, but this may have to be a separate topic for control. Also, what about TFs?

Errors

Findings:

There is a mean error (blue lines in the graphs above) probably caused by bias, which is more or less constant for different angular velocities. This should be possible to compensate using good calibration.
There is also a standard deviation error (orange lines) which increases with the angular velocity in the respective axis.
- For some reason, this error is higher (about double) for the orientations coming from the MUS estimator.
- This error probably comes partially from the PixHawk's EKF linearization and partially from the sensor itself.

Questions and ideas:

How is this possible that the standard deviation error is higher coming from MUS than from MAVROS?? Maybe an evaluation error?
- Hypothesis: This may be caused by delay jitter because of the way the estimator output is published. How to test this?
Although it may not be possible to remove the standard deviation error, we can try to reduce it with better sensors and estimators.
Furthermore, we can measure its dependency on the angular velocity (or possibly also acceleration?) and take it into account in our algorithms.
Again, we should consider measuring the accurate error characteristics using a rw testing rig.

Methodology and testing problems

Findings:

PlotJuggler seems to struggle with displaying data at high real-world rates (@Parakh has more insight on this, we should add him here). To get accurate plots, you need to slow down the rosbag/simulation so that the topic you're displaying is only about ~50Hz max. (probably even lower rate is better).
Simulation runs only at 250Hz by default, which may not correspond well to the real-world IMU rates (which are at 1kHz IIRC). Also, we should make sure that the PX4 EKF2 estimator runs at the same rate in simulation as real-world.

penickar commented 9 months ago

Great overview! I am very much interested in this as well since we noticed the same issue inside control and we compensate for it using feed-forward estimation using UAV dynamic model and history of sent commands.....

matemat13 commented 9 months ago

Updated the original post with data measured on a more complex trajectory and with the fix in simulation to correctly publish data from PixHawk at 100Hz instead of 30Hz, which significantly reduced the delay introduced by the MUS estimator (although not 100%).

penickar commented 9 months ago

Thanks for the work!!! I would also like to ask a question regarding the angular velocity data quality from the estimator compared to the one from px4. It seems like the estimator is too many times publishing the same value or in general is less smooth than the px4 data it uses... Is there some reason for it? omegas

matemat13 commented 9 months ago

Thanks for the work!!! I would also like to ask a question regarding the angular velocity data quality from the estimator compared to the one from px4. It seems like the estimator is too many times publishing the same value or in general is less smooth than the px4 data it uses... Is there some reason for it?

This is probably a question to @petrlmat, but we've also observed this strange behavior. Interestingly, it only seems to manifest for the uav_state topic and not the odom topic, so in the meantime, I recommend using that (unless you need also the accelerations...). I suspect this may possibly be due to some mutexing in the MUS estimation manager.

Also - do we know if PlotJuggler displays the data according to timestamps or to the time of arrival of the messages? Because when I tried generating similar graphs as in the OP for the uav_state message, I didn't see an increase in error (although from the graphs in PlotJuggler as you've shown the error is obviously higher), so it may be that the uav_state is published repeatedly with the same angular velocities but also the same timestamp... which would still be weird, but less "wrong" in some sense. It may also help shine some light on why it is like this.

In any case, a deeper investigation of the difference between how the uav_state message vs the odom message are published is needed.

Edit: Hmm I cannot seem to reproduce the issue with odom vs uav_state difference now. I'll try it on my home PC where I first observed it, but it may be a timestamp vs PlotJuggler bug (see the original post) problem.

petrlmat commented 9 months ago

Thanks for the work!!! I would also like to ask a question regarding the angular velocity data quality from the estimator compared to the one from px4. It seems like the estimator is too many times publishing the same value or in general is less smooth than the px4 data it uses... Is there some reason for it?

Could you please elaborate more on this issue? I don't see the scale of the trajectory and the individual datapoints. It seems that the estimation is just running at a lower rate than the px4 data is published. The rate of estimation depends on the control manager and is not tied to the rate of px4, sensors, etc. You can check the desired estimation rate at the topic /uav1/control_manager/diagnostics. The desired rate depends on the output modality of our control, e.g., if you want to control the UAV using velocities, the estimation rate will be 20 Hz.

Using the current master branches and default tmux session I don't observe this behavior.

petrlmat commented 9 months ago

Thanks for the work!!! I would also like to ask a question regarding the angular velocity data quality from the estimator compared to the one from px4. It seems like the estimator is too many times publishing the same value or in general is less smooth than the px4 data it uses... Is there some reason for it?

This is probably a question to @petrlmat, but we've also observed this strange behavior. Interestingly, it only seems to manifest for the uav_state topic and not the odom topic, so in the meantime, I recommend using that (unless you need also the accelerations...). I suspect this may possibly be due to some mutexing in the MUS estimation manager.

Also - do we know if PlotJuggler displays the data according to timestamps or to the time of arrival of the messages? Because when I tried generating similar graphs as in the OP for the uav_state message, I didn't see an increase in error (although from the graphs in PlotJuggler as you've shown the error is obviously higher), so it may be that the uav_state is published repeatedly with the same angular velocities but also the same timestamp... which would still be weird, but less "wrong" in some sense. It may also help shine some light on why it is like this.

In any case, a deeper investigation of the difference between how the uav_state message vs the odom message are published is needed.

The angular velocities in uav_state and odom_main msgs should be the same. The header and the angular velocities are simply copied during conversion.

  static nav_msgs::Odometry uavStateToOdom(const mrs_msgs::UavState& uav_state) {
    nav_msgs::Odometry odom;
    odom.header              = uav_state.header;
    odom.child_frame_id      = uav_state.child_frame_id;
    odom.pose.pose           = uav_state.pose;
    odom.twist.twist.angular = uav_state.velocity.angular;

    tf2::Quaternion q;
    tf2::fromMsg(odom.pose.pose.orientation, q);
    odom.twist.twist.linear = Support::rotateVector(uav_state.velocity.linear, mrs_lib::AttitudeConverter(q.inverse()));

    return odom;
  }

The odom_main message can get published later than uav_state because it is also reading the covariance from the estimators (which should probably be done sooner when reading the state of the estimator) but the timestamp should be the same.

PlotJuggler displays the data according to the time of arrival by default. If you want to use timestamp, you need to check the box at the top.

petrlmat commented 9 months ago

Updated the original post with data measured on a more complex trajectory and with the fix in simulation to correctly publish data from PixHawk at 100Hz instead of 30Hz, which significantly reduced the delay introduced by the MUS estimator (although not 100%).

@matemat13 can you please check the delay with this branch? The publishing is then controlled by the rate at which orientation msg arrives and also the timestamp is taken from the orientation msg so the delay should disappear.

I don't know if this would be a viable fix as it would make the estimation rate depend on the rate of orientation msg. Maybe it would be possible to use orientation-triggered state publishing for attitude and lower-level output modalities and leave it as it is for the higher-level ones? @klaxalk

klaxalk commented 9 months ago

@petrlmat yes, it would make sense to publish the rates and orientation separately at higher rate and leave the whole uav_state at the rate the control manager requests.

klaxalk commented 9 months ago

Oh, I probably misunderstood your suggestions. So when the lowest possible control modality would be attitude or lower, you have the rate based on the incoming attitude rate/attitude?

matemat13 commented 9 months ago

Because the output of the estimators is not used only as input for the control, I don't think it's a great idea to design it only with control in mind. For e.g. mapping and object localization in a world frame, it is necessary to know the UAV pose as accurately as possible (including a correct timestamp). For agile trajectory planning, it is even important to know the full UAV state (not just pose) and also with as short delay as possible.

So I propose to design the estimation with accuracy in mind first, meaning publishing the data as quickly as possible at the highest rate available and with correct timestamps (probably using prediction to achieve accurate fusion of data from sensors with different rates). If control (or other modules) need some data at a specific rate, it should be published on a specialized topic for this purpose in my opinion. What do you think @penickar, @petrlmat, @klaxalk?

I think that the new orientation_triggered_pub branch Matěj made is a good step in this direction. I'll test it to see if the results correspond to our expectations (there should now be no timestamp delay with respect to messages directly from MAVROS).

parakhm95 commented 9 months ago

@matemat13 Do you know how the timestamps are handled between Mavlink and Mavros? Because I know that the Pixhawk has its own timestamps that come from hrt_absolute_time() and then ros has its time from ros::Time::now(), are they kept independent of each other and timestamps are assigned as the interface is crossed from mavlink to mavros and vice-versa?

matemat13 commented 9 months ago

@matemat13 Do you know how the timestamps are handled between Mavlink and Mavros? Because I know that the Pixhawk has its own timestamps that come from hrt_absolute_time() and then ros has its time from ros::Time::now(), are they kept independent of each other and timestamps are assigned as the interface is crossed from mavlink to mavros and vice-versa?

Good question! I asked Matěj, Dan and Tomáš (I hope I remember correctly, if not, then sorry guys :D) and they all confirmed that there is some kind of time synchronization going on between the onboard PC and PX4.

Looking into MAVROS, the time synchronization seems to be triggered by the time/timesync_mode ROS parameter, which is set to MAVLINK in the HW API. This should enable time synchronization (or rather offset compensation) on MAVROS's side.
If the timesync mode is set to MAVLINK or ONBOARD, then this algorithm is implement the offset calculation which is then used here to correct the timestamp received from PixHawk. It seems to be the classic algorithm based on the assumption that the transport delay is the same both ways (MAVROS -> PixHawk and PixHawk -> MAVROS).
The estimation algorithm has several tunable parameters (loaded from here to here). It is hard to say how well chosen these are...
This timestamp correction is applied to the timestamps received with the messages from PixHawk, such as here.

In summary - yes, I think the timestamps of the messages coming from MAVROS should be fairly accurate assuming that the delay estimation has converged, that the estimation parameters are well chosen, and that the transport delay is the same both ways. We could try playing around with the parameters to see if we can get a better estimate. I don't see another way to verify whether the ~10ms MAVROS delay is internal to PixHawk or communication-induced. Any ideas?

Edit: I wonder if bandwidth imbalance may cause an imbalance in the timesync message transport delays. I would guess that we're sending much less data to PixHawk than what is coming from PixHawk, no? That may cause the messages to be delayed more when coming from PixHawk simply because of a sending queue.

Also I assume that the communication bitrate is the same both ways, is that right @DanHert ?

ctu-mrs / mrs_uav_state_estimators