PX4 / PX4-Autopilot

PX4 Autopilot Software
https://px4.io
BSD 3-Clause "New" or "Revised" License
8.09k stars 13.33k forks source link

[Bug] Time is not tracked correctly after power cycle or reboot #22033

Open dmammolo opened 11 months ago

dmammolo commented 11 months ago

Describe the bug

I observed that on some pixhawks that date and time is not tracked correctly over reboots or power cycle. This can be observed using the date or system_time get command before and after a reboot. This issue does not occur on all pixhawks. I observed this issue already on multiple holybro pix32v5 and also on holybro pixhawk 6x.

Around 30% of the pix32v5 seem to suffer from it and in the worst case the time jumps back significantly. I just recently did a few extensive tests with one pix32v5 where I observed following:

It looks like the time is tracked around 3 times slower than it should, although for smaller duration the factor seems smaller.

My current guess is that it is either a hardware issue (i.e. the LSE oscillator is ticking way slower than it should) or it is a misconfiguration of the RTC or LSE that leads to inconsistent behavior across pixhawks.

Note, that while the pixhawk is running the time is tracked correctly, only after a reboot or power cycle the time jumps back.

To Reproduce

To reproduce you just need a pixhawk and an interface to the nsh console. But your pixhawk might be fine thus it is hardware dependent.

  1. Power on
  2. Wait a few minutes and execute date into the console
  3. Execute reboot
  4. Again execute date and compare the time you extracted in step 2.
  5. If you have a faulty pixhawk you will observe a significant time jump.

Expected behavior

The time should be tracked correctly over reboots, power cycle or longer power off durations.

Screenshot / Media

No response

Flight Log

-

Software Version

Release 1.13.3

Flight controller

Holybro: pix32v5, pixhawk 6x

Vehicle type

None

How are the different components wired up (including port information)

No response

Additional context

No response

dagar commented 11 months ago

Any thoughts here @davids5?

davids5 commented 11 months ago

@dmammolo Are the FC connected to a GPS or network and not connected to QGC via USB?

The test I normally do is with no GPS, no QGC, and no network connection boot the system. Issue the date command: date -s "Sep 13 10:00:00 2020" Start a stop watch and unplug the FMU Wait 3 min or so power on the FMU and issue the date command: date I verify the time is with in a second or 2.

@dmammolo Do you have access to a frequency meter or an oscilloscope? Can you measure the 32Khz xtal?

@vincentpoont2 Can you please verify this.

vincentpoont2 commented 11 months ago

We tested it on two Pix32 v5 and one Pixhawk 5X, and the unable to reproduce the issue. We will test a few more board to see if we can find one that can reproduce.

RTC_Test_1 RTC_Test_2 RTC_Test_3

dmammolo commented 11 months ago

@davids5 I did following tests again:

  1. Similar to @vincentpoont2 with rebooting an got following: Screenshot from 2023-09-08 09-52-46 I tracked the time and exactly 3min elapsed before rebooting. Then you can see after the reboot the time jumped back by ~1min.

  2. I did test by unplugging the USB for ~ 3min. Weirdly in my first try the time got reset to Jan 01 2000. (Note this behavior I haven't observed until now) Nevertheless I did the test after that and stopped the time, On my stopwatch I have 03:08.48 elapsed but this happened: Screenshot from 2023-09-08 10-00-11

  3. Test setup as you suggested (I used a debugger) where ~3:10min elapsed:

    
    date -s "Sep 13 10:00:00 2020"

nsh> date Sun, Sep 13 10:00:05 2020


if instead I let it run and reboot I observe that:

nsh> date Sun, Sep 13 10:00:05 2020 nsh> date Sun, Sep 13 10:03:05 2020 nsh> reboot

NuttShell (NSH) NuttX-11.0.0 nsh> date Sun, Sep 13 10:02:13 2020



> @dmammolo Do you have access to a frequency meter or an oscilloscope? Can you measure the 32Khz xtal?

Yes I have access to an oscilloscope but I would require some instructions on where the 32Khz xtal is located and how I can measure the frequency.
dmammolo commented 11 months ago

Note, I just tried to execute the same tests on another pix32v5 with which I observed similar issues. With this pixhawk I can't even set the date using date -s "Sep 13 10:00:00 2020" after a reboot or powercycle the time would be reset to the date it had before. Before the reboot the time was updated. Nevertheless with the time it had had before I see following:

nsh> date
Fri, Sep 08 08:26:57 2023
nsh> date
Fri, Sep 08 08:28:50 2023
nsh> reboot

NuttShell (NSH) NuttX-11.0.0
nsh> date
Fri, Sep 08 08:27:41 2023

With a third pix32v5 it works setting the date and I observe this with USB unpluggging (03:08 min elapsed):

NuttShell (NSH) NuttX-11.0.0
nsh> date -s "Sep 13 10:00:00 2020"

NuttShell (NSH) NuttX-11.0.0
nsh> date
Sun, Sep 13 10:00:08 2020
davids5 commented 11 months ago

We tested it on two Pix32 v5 and one Pixhawk 5X, and the unable to reproduce the issue. We will test a few more board to see if we can find one that can reproduce.

RTC_Test_1 RTC_Test_2 RTC_Test_3

@vincentpoont2 - the test can not be made with QGC connected. Please use the debug console

davids5 commented 11 months ago

There are 2 element being tested. The crystal and the battery. This is why reset is not used.

Start a stop watch and unplug the FMU Wait 3 min or so power on the FMU and issue the date command: date

If the date is reset to Jan 1, then the battery is bad or dead.

If the xtal is not running then the date will be the last set value+time to boot.

dmammolo commented 11 months ago

The reset to Jan 1 was a 1 time thing and yes might be related the battery.

Regarding the xtal, what you describe seems to be the case that it is not running, i.e. after powering it still has the same time as before time to boot. Would you say that the xtal is broken or just in SW not setup correctly? Although during reboot it behaves different, i.e. the time proceeds but not as much as it should. Can you explain that behavior?

davids5 commented 11 months ago

The RTC' time is used at boot to set the NuttX time. NuttX time. marches along off the system tick interrupt.

System without a RTC Xtal use the HSE Highspeed external oscillator. Time set operations set the RTC. (from GPS, QGC, command line etc) When power is removed the RTC stops counting and retains the last set time if the battery is good..

System With a RTC Xtal use the LSE Lowspeed external oscillator. Time set operations set the RTC. (from GPS, QGC, command line etc) When power is removed the RTC keeps counting from the last set time if the battery is good.

junwoo091400 commented 11 months ago

Wow didn't even know this functionality existed. It's kinda mind blowing that system can keep track of time without RTC, while being uncovered? How's is it even really working?

dmammolo commented 11 months ago

@vincentpoont2 any luck finding a pixhawk?

@davids5 Here is a screenshot from a pix32v5 where I think the xtal is located. You can see the two HSE xtals 16 and 24 MHz, on these I can measure with an oscillator the signal. The 32kHz LSE xtal I assume is in between, but on that one I can't measure anything, this also on a pixhawk where the RTC is working. Should there be measurable signal if the xtal is actually used? If it is not used which xtal is it actually used such that the RTC works just with the small integrated battery? image

davids5 commented 11 months ago

Ok well looking at the board config

The RTC is not set to use the LSE.

https://github.com/PX4/PX4-Autopilot/blob/main/boards/holybro/pix32v5/nuttx-config/include/board.h#L72 https://github.com/PX4/PX4-Autopilot/blob/main/boards/holybro/pix32v5/nuttx-config/nsh/defconfig#L199

dmammolo commented 11 months ago

I am using px4 fmu v5 which also works on pix32v5. Also if flashing over QGC it flashes the px4 fmu v5 version HW arch: PX4_FMU_V5 (output of ver all). The same issue I also observed on a pixhawk 6x which has the LSE configured.

But you might have indicated on an issue that is also on px4 fmu v5, see: https://github.com/PX4/PX4-Autopilot/blob/main/boards/px4/fmu-v5/nuttx-config/include/board.h#L72 https://github.com/PX4/PX4-Autopilot/blob/main/boards/px4/fmu-v5/nuttx-config/nsh/defconfig#L198 (This is configured as LSE)

davids5 commented 11 months ago

The LSE is used to march time while not powered.

Things that can mess this up:

  1. The bootloader may be resetting the RTC - that was fixed as some point.
  2. The LSE is not starting. There is code to step up the drive to try to get the LSE to start. You can disable that and set values. See https://github.com/PX4/NuttX/blob/c23b72dffeb0de0d1a836ab561eb9169c4a9e58e/arch/arm/src/stm32f7/stm32_lse.c#L105C8-L105C56
  3. Bad Xtal or solder joint.
dmammolo commented 11 months ago

@davids5 I think I have figured out the issue and got the RTC working on one of my pix32v5 but I am not sure how to cleanly fix it, Maybe you could help on the Nuttx side.

The main issue IMO is that the LSE xtal might not be running (because it failed to start or stopped working after the the battery drained out?) and the nuttx implementation does not check that and only tries to to start the LSE once. See these lines of code:

IMO the nuttx code is missing a check if the LSE xtal or probably any currently used clock is still running and if not tries to re-start it?

Further I think following changes are required on the px4 side:

All above issues might be needed to be applied to all STM32 chips and pixhawk boards that are supported.

dmammolo commented 11 months ago

I further observed that with CONFIG_STM32F7_RTC_AUTO_LSECLOCK_START_DRV_CAPABILITY it sometimes drives the xtal with RCC_BDCR_LSEDRV_LOW and sometimes with RCC_BDCR_LSEDRV_MEDLO. Always if it drives it with RCC_BDCR_LSEDRV_LOW only the RTC/xtal does not work.

davids5 commented 11 months ago

IMO the nuttx code is missing a check if the LSE xtal or probably any currently used clock is still running and if not tries to re-start it?

If the battery dies, the magic should be erased. I can see there is a possibility of the LSE being turned off in some revs of the bootloader. This needs to be checked. I can imagine the LSE not running and needing to be restarted, So that my be a good addition to the rcc code.

As for the fmuv5 changes. We added the xtal in the design but left the HSE bring used because that is how the F4 FMUs (fmu-V[1-4]) were.

I am not keen on changing that as it works and as you can see there are other 32khz issue. We doe need to insure V5x anV6x are correctly functioning

vincentpoont2 commented 11 months ago

@davids5

We tested 10 Pixhawk 6X using this method.

  1. After powering on via an Power Module, enter the date command in debug console.
  2. Enter date command again two minutes.
  3. Enter date command again five minutes.
  4. Power off for one minute, power on again via PM and enter the date command.

It seems like only 8 out of the 10 Pixhawk 6X failed to track the time after being powered off for 1 min.

image

We checked the farad capacitor voltage retention time on 3 failed united, and they seems to be fine.

image

davids5 commented 11 months ago

@vincentpoont2 Please retest one of the failed units with the command

date -s "Sep 13 10:00:00 2020"

This ensures we can tell the if it is a non powered RTC.

Can you measure the voltage at the CPU or on a trace that goes to it.

Can you measure the 32Khz xtal?

What bootloader is installed. Can you send me the bin.

If you can not isolate the failure from HW, you can use a JTAG to see the RCC/RTC and BK registers to see if that provides more information.

davids5 commented 11 months ago

@vincentpoont2 Also steps 2 and 3 are useless because the RTC is only used at boot to get the time. It is then marched onward by the nuttx system tick.

dmammolo commented 10 months ago

@vincentpoont2 @davids5 any progress here?

vincentpoont2 commented 9 months ago

@vincentpoont2 Please retest one of the failed units with the command Can you measure the voltage at the CPU or on a trace that goes to it.

Does this trace voltage refer to the voltage where the farad capacitor is connected to the CPU PIN? We have measured that it is about 3.0V, and it can last for about 3 to 4 hours after a power outage, and then the voltage drops below 2.0V.

Can you measure the 32Khz xtal?

We can't measure the 32.768K XTAL waveform with an oscilloscope on boards with normal RTC and boards with abnormal RTC. We estimate that the XTAL amplitude is too small and cannot be measured by our oscilloscope. We are looking for a way to test this.

davids5 commented 9 months ago

@vincentpoont2 If you want me to help resolve this. Please send me a production unit that fails and one that passes.

dmammolo commented 5 months ago

@davids5 I just found a recent PR from you: https://github.com/PX4/PX4-Autopilot/pull/22503 I tested it on my pixhawk6x using the LSE clock and it seems to fix this issue (at least on this one I just tested). Could there still be a similar issue on STM32F7 pixhawks?

Correction: on a second pixhawk6x it does not work :(