ArduPilot / ardupilot

ArduPlane, ArduCopter, ArduRover, ArduSub source
http://ardupilot.org/
GNU General Public License v3.0
11.02k stars 17.57k forks source link

Long loops on MatekH743 - 16 bit system timer bug? #18383

Closed tridge closed 3 years ago

tridge commented 3 years ago

We have 3 examples now of logs showing long loops on MatekH743 boards. They look like this:

2021-08-01 13:54:39.96: PM {TimeUS : 235578505, NLon : 1, NLoop : 4000, MaxT : 68030, Mem : 576512, Load : 240, ErrL : 0, IntE : 0, ErrC : 0, SPIC : 2578676, I2CC : 61916, I2CI : 126129, Ex : 0}
2021-08-01 14:01:00.02: PM {TimeUS : 615643327, NLon : 1, NLoop : 4000, MaxT : 68036, Mem : 576512, Load : 316, ErrL : 943, IntE : 1048576, ErrC : 2, SPIC : 6823323, I2CC : 161189, I2CI : 327637, Ex : 0}

the really notable feature of this issue is that the MaxT is always very close to 68036 us. That is 65536 + 2500, where 2500 is the normal loop time at 400Hz. These boards have 16 bit system timers (only a few boards have 16 bit system timers) which means the low level system timer wraps at 65536us, so I think the bug is likely in the handling of that wrap in hrt.c. I can't yet see the bug, but we should look at it carefully

andyp1per commented 3 years ago

Interesting! We could always give it a 32bit system timer by sacrificing something else.

tridge commented 3 years ago

Interesting! We could always give it a 32bit system timer by sacrificing something else.

yes, but I'd rather fix the bug

tridge commented 3 years ago

I think this graph proves that it really is a delay in wall clock time, and not just an error in accounting (ie. not a slip in our time measurment): image a GPS sample happened to be due during the 68ms gap, and that sample was delayed. If we just had an accounting error then we would have seen the gap between the two final GPS points in that graph separated by 200ms. This shows the GPS sample really was delayed (by 40ms)

tridge commented 3 years ago

the bug really has to either be in on of the following:

andyp1per commented 3 years ago

Could this cause a flyaway? There was a report of someone with a flyaway on 4.1 using this board when switching to auto.

rmackay9 commented 3 years ago

@andyp1per, I think that's unlikely. So far I've only heard of it causing an internal error because the position controller is more strict in 4.1 about timing and it logs this internal error if there is more than a 13ms delay.

peterbarker commented 3 years ago

We believe this was fixed with #18416 - strictly by the issue title this is all done. Better fixes are coming.