bird-sanctuary / bluejay

:bird: Digital ESC firmware for controlling brushless motors in multirotors
GNU General Public License v3.0
342 stars 36 forks source link

Potential shot-through condition #187

Open tobbeanton opened 8 months ago

tobbeanton commented 8 months ago

Describe the issue

We are testing Bluejay for our upcoming Crazyflie 2.1 - brushless. As part of this we do autonomous flight testing over and over again, we jokingly call it infinite flight test. What we have noticed is that sometimes it just resets in mid air and we directly suspected the ESC. We built a small test rig where we can measure the mosfet signals as well as the battery voltage and cycle the PWM 10% and 100% every 300ms. This way we managed to capture the voltage dip and find a H-bridge shot-though condition. This usually happens within a minute using this test setup. image As can be seen in the image Bc and Bp mosfet signals are both on for a short period of time causing the shot-through. This happens in the transition from breaking to accelerating where it looks like Bp is one PWM cycle late (or Bc early). I'm pretty sure this appens for the other phases as well but I don't have a capture of it.

We tried BLHeli_S 16.7 on which we could not detect the shot-though condition.

The full capture is attached and can be viewed using the Salae Logic. Shot-through-mosfet-channels.zip

Bluejay version

0.19.2 & 0.20.1-RC2

ESC variant

O_H_10

PWM frequency

48

DShot bitrate

300

Bidirectional DShot

Off

FC firmware

Crazyflie 2024.2

Motor size

08028

Configurator debug log

No response

stylesuxx commented 8 months ago

Interesting, thank you for the detailed report. Did you by any chance try with 24kHz PWM setting too? Also have you tried increasing dead-time to 15?

cc @damosvil - do you have any input on this, or other things you want to see tested?

tobbeanton commented 8 months ago

We recently tried 24KHz PWM and found the same issue, but now on phase C. image

And here are the settings we used for this capture image

stylesuxx commented 8 months ago

Also have you tried increasing dead-time to 15?

Just in case you missed it.

tobbeanton commented 8 months ago

Due to the rarity of this appearing I'm guessing it is a timing thing, probably some interrupt that triggers att just the right time, causing the PWM to be updated a bit separated in time.

tobbeanton commented 8 months ago

Just in case you missed it.

I did not try dead-time 15. Would be strange if this was the cause but better be safe then sorry. Will try tomorrow.

damosvil commented 8 months ago

It seems a problem related to setting the PCA registers, but checking Blheli_S and Bluejay source code it seems they are both setting those registers in the same place:

https://github.com/bitdump/BLHeli/blob/ef8c1a0b644c228f07a82f3d25e6d581492eaacf/BLHeli_S%20SiLabs/BLHeli_S.asm#L1492

https://github.com/bird-sanctuary/bluejay/blob/b803b8de71a0f8f12c30b8c105d1ab9a3b287d77/Bluejay.asm#L1110

It seems that in some occasions Xp starts working a PWM cycle before Xc (updating Xp and Xc are not synchronized to the PWM cycle), something that agrees with the code. What I don't understand is why you cannot reproduce the same issue in Blheli_S, because both codebases do the same.

¿Have you found any pattern to reproduce this issue? ¿How frequent is it in your hw? ¿could you alternate one of the led GPIOs before updating the PCA registers and also scope it? - If you need a customized fw to do this let us know. ¿could you check if you can also reproduce this issue with Bluejay 0.16?

tobbeanton commented 8 months ago

What I don't understand is why you cannot reproduce the same issue in Blheli_S, because both codebases do the same.

Let us try longer and perhaps we can replicate it in Blheli_S too.

could you check if you can also reproduce this issue with Bluejay 0.16

Yes we could reproduce it, and this time it happened in the middle of breaking, not in the change from breaking to accelerating. image

could you alternate one of the led GPIOs before updating the PCA registers and also scope it? - If you need a customized fw to do this let us know.

We will give it a try

damosvil commented 8 months ago

I have been talking with Alka (the creator of AM32) and he suggests that this might be a problem related to not using a gate driver like the fd6288, that implements shoot through prevention. He also said that ARM MCUs do complement the PWM in hardware so it seems it is an issue related tinywhoop hardware in general that uses EFM8BBx MCUs. imagen

damosvil commented 8 months ago

What we can do for the next version is not to update the PCA registers if the PCA counter is about to expire. This way Xp and Xc will be updated synchronized with the PCA cycle. This would fix the issue and I think it would not hit performance noticeably. Another solution, but only a mitigation would be to update first the low part of the power and damp registers and then update the high parts together, so the issue would probably happen a 50% less, this way not hitting performance.

tobbeanton commented 8 months ago

To me, checking for PCA counter to expire, sound like the right way to do it. Since the auto-reload registers are used there is already a "performance" hit since it can take almost a full cycle before the PCA registers are updated. And I don't think there is any other safe way to do it.

It sounds a bit challenging to implement but we are happy to test it if you know how to do it @damosvil

tobbeanton commented 8 months ago

Another thing I was thinking about, why we are not able to replicate it in Blheli_s 16.7. It could just be a coincident that we have not manage to catch it but we have tested for ~20min and for Blujay it usually happens within 1min. Could this be related to interrupt rather then the auto-reload registers?

damosvil commented 8 months ago

To me, checking for PCA counter to expire, sound like the right way to do it. Since the auto-reload registers are used there is already a "performance" hit since it can take almost a full cycle before the PCA registers are updated. And I don't think there is any other safe way to do it.

It sounds a bit challenging to implement but we are happy to test it if you know how to do it @damosvil

Ok, I will try a modification and I will let you know

damosvil commented 8 months ago

Another thing I was thinking about, why we are not able to replicate it in Blheli_s 16.7. It could just be a coincident that we have not manage to catch it but we have tested for ~20min and for Blujay it usually happens within 1min. Could this be related to interrupt rather then the auto-reload registers?

I have checked Blheli_S code again and I think that they do something to avoid the issue in the pca_int isr: https://github.com/bitdump/BLHeli/blob/ef8c1a0b644c228f07a82f3d25e6d581492eaacf/BLHeli_S%20SiLabs/BLHeli_S.asm#L1567

But I think that ISRs add additional latency so it would be better not to update the PWM registers if PCA counter is about to expire and reorder PCA register writes.

damosvil commented 8 months ago

I have been checking EFMBB2 reference manual and it seems it may be not so easy to control when to load Xc and Xp registers: imagen I will check Blheli_S solution again.

damosvil commented 8 months ago

I think that a valid solution would be that, when a new dshot frame arrives, to store the power and damp values, and activate the PCA interrupt (generated when PCA counter is 0). In the interrupt we should set Xp and then Xc, so when the up edges happen both autoreload values are loaded in the same cycle, and disable the interrupt again. I will try to code this solution next week.

tobbeanton commented 8 months ago

I think that a valid solution would be that, when a new dshot frame arrives, to store the power and damp values, and activate the PCA interrupt (generated when PCA counter is 0). In the interrupt we should set Xp and then Xc, so when the up edges happen both autoreload values are loaded in the same cycle, and disable the interrupt again. I will try to code this solution next week.

Sound good, I think this is a common way to handle it.

tobbeanton commented 7 months ago

Just checking how things are going? Anything we can do to help (but doing the actual fix might be above our skill level)?

stylesuxx commented 7 months ago

Hey, just a heads-up. We have not forgotten you, unfortunately we are currently a bit swamped with private life/work so things will take some time.

tobbeanton commented 7 months ago

Thanks for letting us know! It might not be the easiest fix either! Meanwhile we might try the:

Another solution, but only a mitigation would be to update first the low part of the power and damp registers and then update the high parts together, so the issue would probably happen a 50% less, this way not hitting performance.

This we could probably manage ourselves.

stylesuxx commented 7 months ago

@tobbeanton thank you, please let us know how it goes - if it works, we would appreciate a PR.

hyp0dermik-code commented 6 months ago

What I don't understand is why you cannot reproduce the same issue in Blheli_S, because both codebases do the same.

Let us try longer and perhaps we can replicate it in Blheli_S too.

could you check if you can also reproduce this issue with Bluejay 0.16

Yes we could reproduce it, and this time it happened in the middle of breaking, not in the change from breaking to accelerating. image

could you alternate one of the led GPIOs before updating the PCA registers and also scope it? - If you need a customized fw to do this let us know.

We will give it a try

This capture was from 0.16, correct?

Can you confirm what version(s) the previous 2 captures were? https://github.com/bird-sanctuary/bluejay/issues/187#issue-2181610116 https://github.com/bird-sanctuary/bluejay/issues/187#issuecomment-1991762707

Is the timing of the bug exactly the same for each occurrence on the same version, or is there some variation? How many samples?

What variation did you see between 0.19.2 and 0.21RC (0.20.1?) Are you able to provide some instances from the missing version please?

tobbeanton commented 6 months ago

Is there a fix in 0.21RC? Else the bug has more or less been fully identified...?

hyp0dermik-code commented 6 months ago

Is there a fix in 0.21RC? Else the bug has more or less been fully identified...?

No, I was more curious as to what the difference was in timing between 0.21 and .19.2 (if any) at the same PWM frequency

tobbeanton commented 6 months ago

I think the bug has been there for a long time, since the PCA switching code was changed.

alinneacsu commented 3 weeks ago

Hello! Was the bug fixed in the lastest version ?

stylesuxx commented 3 weeks ago

@alinneacsu No, otherwise we would have closed the issue and mentioned in the release notes. Are you experiencing the same issues?

alinneacsu commented 3 weeks ago

@stylesuxx

I think the issue can be similar: my setup includes a FC, based on H743, running Arducopter (bidir dshot enabled) and a 4in1 ESC running BlueJay latest version. Rarely, until now the rate is 1:50 flights, one of the motors simply turns off in flight, but it looks like it is not demag/desync, based on logs. Tried both 24Khz / 48Khz versions, no differences.

Didn't identified any way to replicate the issue, in a controlled environment.

I have many logs indicating this situation, i'm attaching a simple screenshot for now, the constant RPM at the end of the log indicates the moment when the motor stopped:

Screenshot 2024-11-01 at 23 28 44

I'm logging also the following EDT fields: .SS -> EDT Stress Level (120 constantly) .SA -> EDT Status (193, rarely goes to 1)

stylesuxx commented 3 weeks ago

Please attach full logs, so people can look through them.

alinneacsu commented 3 weeks ago

Check this out: https://drive.google.com/drive/folders/1dsq6q2YpsevknT9BYpLhBDjFDhUS9mEA

stylesuxx commented 3 weeks ago

@alinneacsu can you provide some time stamps of interest for those logs please?

Also, what else have you done to troubleshoot this issue? Does it always happen with the same motor? Have you tried to change timings?

The initial issue seems to be reproducable pretty consistently at least at this one setup. So I am not sure if we are looking at the same issue here.