crust-firmware / crust

SCP (power management) firmware for sunxi SoCs
Other
150 stars 34 forks source link

Pinephone time goes too slow #203

Closed sergei-mironov closed 2 years ago

sergei-mironov commented 2 years ago

Type of issue

Bug report

Description

The hardware is Pine64-Pinephone v1.2; Linux 5.13.7; Crust v0.4 (buit with pinephone_defconfig)

I am expecting a constant hardware clock drift backward, i.e. the time goes too slow on the device.

I've made some tests comparing the Pinephone clocks with a remote source (a time server at time.nist.gov). I also repeated the measurements on my desktop PC ("A typical Host PC").

X axis is running time of the measurement script, in minutes Y axis is the time difference, in seconds

As you can see, the Ppinephone's hardware clock is late (bottom plot, orange line). At the time of suspend (gaps on the bottom plot), the orange line goes up despite the software efforts to adjust it during wake. I have stopped ntpd for the measurement session. The correction between 400 and 500 minute happened probably because of unknown time syncing agent which I am not aware of.

(Note, I also opened a similar topic on a Pine64 forum before I had an idea that the Crust may be related. Please feel free to close this issue if in fact it is not)

smaeul commented 2 years ago

This looks like your RTC is using the wrong clock source for some reason.

The A64 inside the PinePhone has three clock sources:

When the system is awake, the system clock runs from HOSC and is used to correct the RTC every 11 minutes (if you have CONFIG_RTC_SYSTOHC enabled in your kernel). This corresponds to the beginning/end of your graph. When the system goes to sleep, crust turns off HOSC to save power, thus pausing the system clock. Linux knows about this from a property in the device tree. So during resume, Linux uses the RTC to determine how long the system was asleep, and then applies that correction to the system clock. That explains why the system clock offset jumps to match the RTC offset.

Now why is the RTC offset so large?

The RC oscillator (IOSC) is extremely inaccurate. Its nominal frequency is 16 MHz, but in reality the one in the A64 runs at 12-13 MHz, decreasing with temperature about 10 kHz/°C. (Other Allwinner SoCs run closer to 16 MHz, but often with even stronger temperature/power dependence.)

On the other hand, the 32768 Hz external crystal should be quite accurate. The catch is that, to allow board designers to reduce cost, Allwinner made it optional. They added logic to the SoC to fall back to deriving LOSC by dividing from IOSC, instead of by using the crystal. LOSC is the clock source for the RTC, so if LOSC was being derived from IOSC, that would explain the 15-20% slowdown you observed.

However, the PinePhone does have the 32768 Hz external crystal present, and it is hooked up in the device tree. The RTC driver will see that device tree property and force use of the external crystal. So the RTC should be reasonably accurate.

Could you please paste the contents of /sys/kernel/debug/clk/clk_summary? Also, could you please use devmem2, busybox devmem, or similar to read the 32-bit words at 0x01f00000, 0x01f00004, and 0x01f00008? This would tell me both what Linux thinks the RTC clock source is, and what the hardware is actually doing. (Those registers are described in the "RTC" section of the A64 User Manual, if you are curious.)

sergei-mironov commented 2 years ago

@smaeul Thank you for the explanation! Now it is more or less clear to me what is the suspect.

First, I'd like to give some additional details on the measurement: I did stop ntpd on my host for the time of the measurement, but on the pinephone I actually don't have ntpd installed. Instead, time sync is likely done by the systemd-timesyncd daemon. This link should point to the base source of this daemon, as used on my system, minus the patches (but the patches are probably unrelated). Now, I think that I disabled it during the main part of the measurements, but I'm not sure. Probably, these saw-like parts image correspond to the moments where it was in fact enabled. They look like the 11-minute intervals when the kernel updates the RTC as instructed by clock_adjtime. Maybe I should repeat the measurements with more care. Anyway, I agree that something should be wrong with RTC.

Here is the contents of /sys/kernel/debug/clk/clk_summary of the pinephone: clk_summary.txt

Here is the devmem2 session (I assume the 32-bit word is a half-word, since it is arm64):

[root@nixos:~]# devmem2 0x01f00000 half-word
/dev/mem opened.
Memory mapped at address 0xffff8396b000.
Value at address 0x1F00000 (0xffff8396b000): 0x1

[root@nixos:~]# devmem2 0x01f00004 half-word
/dev/mem opened.
Memory mapped at address 0xffffad9a3000.
Value at address 0x1F00004 (0xffffad9a3004): 0x4

[root@nixos:~]# devmem2 0x01f00008 half-word
/dev/mem opened.
Memory mapped at address 0xffffb0721000.
Value at address 0x1F00008 (0xffffb0721008): 0xF

For reference, the below links point to the exact kernel config and the kernel itself, as used in my system:

Edit: From your explanation it seems that during awake, Linux should switch the clock source to HOSC regardless of LOSC/IOSC situation. May the slope of the orange line indicate that it in fact didn't happen (the clock source is left LOSC/IOSC after resume) ?

samueldr commented 2 years ago

(Note that with Mobile NixOS, the kernel build aborts if the config file in input differs from what is used by the build, so this is as good as dumping /proc/config.gz, if there were doubts.)

smaeul commented 2 years ago

Now, I think that I disabled it during the main part of the measurements, but I'm not sure. Probably, these saw-like parts image correspond to the moments where it was in fact enabled. They look like the 11-minute intervals when the kernel updates the RTC as instructed by clock_adjtime.

The 11-minute timer is inside the kernel, and runs as long as userspace told it "the clock is synchronized" at some point. It doesn't depend on userspace continuing to do anything. So it doesn't mean you still had a time synchronization daemon running.

On the other hand, the kernel stops updating the RTC if the system time is modified, and injecting the sleep duration counts as modifying the time. That explains why there is no sawtooth pattern in the middle of the graph after the first suspend/resume cycle.

clk_summary looks correct, with osc32k underneath ext-osc32k. And the first register value confirms that Linux selected the external crystal as the LOSC source. However, the second register value reports that IOSC is still actually being used. I'm not sure how that can happen... I will have to do some more investigation. In the meantime, you can try changing the value at 0x01f00008, which is the divider from IOSC to LOSC. Try setting it from 0xf to 0xc. If that improves the drift, then 1) it's a workaround until Linux is fixed, and 2) it confirms that IOSC is indeed being used, as otherwise the divider has no effect.

Edit: From your explanation it seems that during awake, Linux should switch the clock source to HOSC regardless of LOSC/IOSC situation. May the slope of the orange line indicate that it in fact didn't happen (the clock source is left LOSC/IOSC after resume) ?

No, the A64 RTC clock source can never be set to HOSC. That's why the RTC always drifts even when periodically corrected (the sawtooth). The system time uses an entirely different bit of timer hardware running from HOSC.

sergei-mironov commented 2 years ago

In the meantime, you can try changing the value at 0x01f00008, which is the divider from IOSC to LOSC. Try setting it from 0xf to 0xc. If that improves the drift, then

  1. it's a workaround until Linux is fixed, and
  2. it confirms that IOSC is indeed being used, as otherwise the divider has no effect.

Oh, look at this! I did devmem2 0x01f00008 h 0xc on the 28th minute and I can clearly see the result

image

ghost commented 2 years ago

I tried checking this too on my system, and I have value 1 in RTC register at +0x04 (indicating precise oscillator being used) I tried suspend/resume cycle and it didn't change anything.

Maybe the 32768Hz crystal oscillator is failing and increasing EXT_LOSC_GSM may help. (I believe this sets the gain in the positive loop of the oscillating circuit).

https://megous.com/dl/tmp/d6effd4e7215fbb0.png

bit 2 in +0x04 register is not documented in A64 manual, but it is in H6 one:

https://megous.com/dl/tmp/ef027f9296a9ae10.png

So this may be simply an issue with your particular phone's 32768Hz crystal.

sergei-mironov commented 2 years ago

On the next day the plot looks like this

image

The systemd-timesyncd daemon are running, and the last ntp synchronization happened in the 100..200 interval. (Why didn't the phone go to sleep after that - is another question, it was on usb charger, maybe it is somehow related). The thing is, with 0xC in the register 0x01f00008 the clock is running too fast which is actually better. And it means that the phone indeed uses a low-precision clock. Also, it looks like the orange line has a sharper slope near closer to the end of measurements. I imagine this could be because I took the phone for a walk (and it's pretty cold outside).

Speaking of workarounds, I wonder if it is possible to make the software not believe RTC after resumes or not. Some searching shows this place in the kernel sources, where they hardcode a 2 sec threshold when comparing two differences, but I don't understand this logic well enough.

For now, I am going to experiment with another values for the prescaler. Probably, putting 0xD would make it more suitable. Also I'm going to try forcing the LOSC switch by writing 0x1 to 0x01f00004.

Edit: I found that bit 0 of 0x01f00004 is readonly.

ghost commented 2 years ago

For example BSP RTC driver uses higher gain than the mainline driver, which uses the lowest gain.

BSP writes 0b10 to EXT_LOSC_GSM which is second highest possible gain.

sergei-mironov commented 2 years ago

Maybe the 32768Hz crystal oscillator is failing and increasing EXT_LOSC_GSM may help. (I believe this sets the gain in the positive loop of the oscillating circuit).

And it does!

Writing devmem2 0x01f00000 half-word 0xD switched reg 0x4 from 4 to 1 and the orange line goes flat immediately.

image

Later the daemon synchronized the clock and now the time delta remains in 2+-10 sec area, but this could also be a result of the measurement script's error.

image (an updated version)

So this may be simply an issue with your particular phone's 32768Hz crystal

Well, could be. It's doesn't look as a common problem, given the Pine64 forum reaction. I think I'll go with this solution, thank you very much!

ghost commented 2 years ago

Interesting. :) What's also interesting is that failover is disabled, so the RTC clock should have stopped completely, instead of running at wrong frequency. It looks like the hardware doesn't respect that, and fails over to RC oscillator anyway.

sergei-mironov commented 2 years ago

Interesting. :) What's also interesting is that failover is disabled, so the RTC clock should have stopped completely, instead of running at wrong frequency. It looks like the hardware doesn't respect that, and fails over to RC oscillator anyway.

I've updated the plot in my previous answer - now it covers interval 500..1400 minustes of testing. Hmm, I though that large failovers was a network delays - because they happened right after resume and both orange (RTC) and blue(system clock) lines are involved. The measurement script is dump and has no protection against e.g. slow wifi re-establishing.

But I do admit that in this new high-GSM mode both lines look noisier than they did in the common mode. Could also be a scale issue (not a hundreds of seconds as before), but who knows..

I am currently in the process of testing the system with systemd-timesyncd disabled, but it seems that the pattern is the same: noisy dots around 2 sec +- 10 (without larger failovers yet, and imo it still could be due to network delays). Still much better then before, since the clock runs fine in the common sence.

sergei-mironov commented 2 years ago

Should I consider a visible increase in power consumption in this high-GSM mode?

ghost commented 2 years ago

Use the lowest gain that works. No, you should not expect higher power consumption (than other users).

sergei-mironov commented 2 years ago

OK. Thanks again. I'm closing this for now.