betaflight / betaflight

Open Source Flight Controller Firmware
GNU General Public License v3.0
8.06k stars 2.87k forks source link

F4 ESC passthrough is flaky. #481

Closed tannewt closed 7 years ago

tannewt commented 7 years ago

I'm running a custom build of the development branch on my own F4FC and passthrough is flaky. It reads the ESCs (Aikon SEMF 30A) ok but the flash usually dies midway. Its possible its a hardware issue but I doubt it.

My next step is to use a similarly designed F3FC to test passthrough with the same base code (awesome!). Pass through worked on that hardware with older versions of Betaflight.

tannewt commented 7 years ago

Passthrough works fine on the F3FC with the new code. I'll take a look into the F4.

tannewt commented 7 years ago

@4712 any guesses or pointers? BLHeliSuite log is:

            cmd_DeviceWrite OK
                >$[2F 3B 15 00 00 FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF 4C 24]
                    263 Bytes = "/;
                    ESC = cmd_Local_Escape
                    CMD = cmd_DeviceWrite
                    ADRESS = 5376[$1500]
                    PARAM_LEN = 0[$00]
                    PARAM = $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    CRC = 19492[$4C24]
                    Time elapsed (ms): 1
                <$[2E 3B 15 00 01 00 00 A0 6D]
                    9 Bytes = ".;
                    ESC = cmd_Remote_Escape
                    CMD = cmd_DeviceWrite
                    ADRESS = 5376[$1500]
                    PARAM_LEN = 1[$01]
                    PARAM = $[00]
                    ACK = OK
                    CRC = 41069[$A06D]
                    Time elapsed (ms): 149
            cmd_DeviceWrite COM Timeout Error
                >$[2F 3B 16 00 00 FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                $[FF FF FF FF FF 54 0A]
                    263 Bytes = "/;
                    ESC = cmd_Local_Escape
                    CMD = cmd_DeviceWrite
                    ADRESS = 5632[$1600]
                    PARAM_LEN = 0[$00]
                    PARAM = $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    $[FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF]
                    CRC = 21514[$540A]
                    Time elapsed (ms): 1
                <[]
                    0 Bytes = ""
                    ESC = UNKNOWN
                    CMD = UNKNOWN
                    ADRESS = 0[$0000]
                    PARAM_LEN = 0[$00]
                    PARAM = []
                    ACK = COM Timeout Error
                    CRC = 0[$0000]
                    Time elapsed (ms): 5647
            Flash writing Block failed
        Progress 12633 of 23273
        Send_cmd_DeviceRead_Gen[Flash]: (0 Bytes) FAILED
            cmd_DeviceRead CMD_ID match failed OK
                >$[2F 3A 1A 00 01 70 84 E1]
                    8 Bytes = "/:
                    ESC = cmd_Local_Escape
                    CMD = cmd_DeviceRead
                    ADRESS = 6656[$1A00]
                    PARAM_LEN = 1[$01]
                    PARAM = $[70]
                    CRC = 34017[$84E1]
                    Time elapsed (ms): 0
                <$[2E 3B 16 00 01 00 00 4E BF]
                    9 Bytes = ".;
                    ESC = cmd_Remote_Escape
                    CMD = cmd_DeviceWrite
                    ADRESS = 5632[$1600]
                    PARAM_LEN = 1[$01]
                    PARAM = $[00]
                    ACK = OK
                    CRC = 20159[$4EBF]
                    Time elapsed (ms): 148
        Send_cmd_DeviceRead_Gen[Flash]: (112 Bytes) OK
            cmd_DeviceRead OK
                >$[2F 3A 1A 00 01 70 84 E1]
                    8 Bytes = "/:
                    ESC = cmd_Local_Escape
                    CMD = cmd_DeviceRead
                    ADRESS = 6656[$1A00]
                    PARAM_LEN = 1[$01]
                    PARAM = $[70]
                    CRC = 34017[$84E1]
                    Time elapsed (ms): 0
                <$[2E 3A 1A 00 70 10 01 20 FF FF FF FF FF FF 09 FF]
                    120 Bytes = ".:
                    ESC = cmd_Remote_Escape
                    CMD = cmd_DeviceRead
                    ADRESS = 6656[$1A00]
                    PARAM_LEN = 112[$70]
                    PARAM = $[10 01 20 FF FF FF FF FF FF 09 FF 02 FF 55 AA 01]
                    ACK = OK
                    CRC = 59862[$E9D6]
                    Time elapsed (ms): 56
    Flash ESC#2 FAILED
        New target ESC#2:
            New target ESC#2:
        cmd_ProtocolGetVersion silBLB:General Error
            >$[2F 31 00 00 01 00 65 85]
                8 Bytes = "/1
                ESC = cmd_Local_Escape
                CMD = cmd_ProtocolGetVersion
                ADRESS = 0[$0000]
                PARAM_LEN = 1[$01]
                PARAM = $[00]
                CRC = 25989[$6585]
                Time elapsed (ms): 1
            <$[2E 3A 1A 00 01 00 0F 71 DB]
                9 Bytes = ".:
                ESC = cmd_Remote_Escape
                CMD = cmd_DeviceRead
                ADRESS = 6656[$1A00]
                PARAM_LEN = 1[$01]
                PARAM = $[00]
                ACK = silBLB:General Error
                CRC = 29147[$71DB]
                Time elapsed (ms): 0
        Disconnect FlightCtrl: OK
            MSP 4Way deactivate FAILED
                elapsed time ms:0
            Close COM7 OK
    Flash ESC#3 FAILED
        New target ESC#3:
            New target ESC#3:
    Flash ESC#4 FAILED
        New target ESC#4:
            New target ESC#4:
tannewt commented 7 years ago

One last thing. If I type into the CLI reasonably fast it seems I can hang it. My guess is a fault somewhere in the VCP code that pass through tickles pretty easily.

4712 commented 7 years ago

Can I try with cc3d Revo?

blckmn commented 7 years ago

@tannewt the vcp4 code currently passes one byte per packet - seriously inefficient - so it needs to be updated to use a full packet.

@4712 yes - there is a revo target now, though I can't test it yet as I don't have one.

4712betaflight commented 7 years ago

Can confirm the issue with the REVO.

tannewt commented 7 years ago

Ok, I played around with the VCP code but am still having the reliability issue. I'll polish it up this week but here is an intermediate commit (d0a21b0d5f6c0d808c1791c71088e3428107dd9b) in case someone wants to pick it up overnight (I'm in Seattle, WA USA).

blckmn commented 7 years ago

@4712betaflight do you have any thoughts on this one? If not I can look at the difference between VCP and VCP4 code, and see if VCP4 can be moved to using either similar or same code.

4712 commented 7 years ago

I would think it is a VCP issue.

To be sure, one could cross test using main port.

blckmn commented 7 years ago

@4712 totally agree its almost certainly a VCP issue :) it works fine on VCP (F1 and F3 targets), just not VCP4 (F4 targets).

blckmn commented 7 years ago

I have a temp fix for this (I don't like it - but will need to spend some more time on it). Seems the VCP on F4 floods the PC with packets. Similar problem as fixed in the CLI.

https://github.com/betaflight/betaflight/pull/560

I've also moved it to the new IO.

blckmn commented 7 years ago

@tannewt the temp fix is now merged, can you please let me know if it solves it for you?

tannewt commented 7 years ago

@blckmn Unfortunately it still doesn't work for me. Reading and writing settings are OK but flashing new firmware hangs the board with the progress bar around 30%. I'm using Windows 10, BLHeliSuite 16114600 and Aikon SEMF ESCs.

Its the cmd_DeviceWrite_Gen[Flash] stages thats 5632 bytes that fails with "COM Timeout Error".

blckmn commented 7 years ago

@tannewt could you increase the delay to 100 and see if that works?

tannewt commented 7 years ago

Yup, will do. My email is scott@chickadee.tech if thats easier to coordinate over (its hosted by Google so hangouts works too.)

On Thu, Jun 23, 2016 at 7:04 PM J Blackman notifications@github.com wrote:

@tannewt https://github.com/tannewt could you increase the delay to 100 and see if that works?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/betaflight/betaflight/issues/481#issuecomment-228238086, or mute the thread https://github.com/notifications/unsubscribe/AADNqcOHyZNn0Fpk1LlBmXBYOY-v5US1ks5qOzs4gaJpZM4IztRH .

tannewt commented 7 years ago

Still no, hangs in the same spot. Could be something with my setup.

blckmn commented 7 years ago

If it is hanging in the same spot every time then I would suspect a problem with an ESC perhaps. Can you try flashing them individually?

I did notice that occasionally (not very often) the BlHeliSuite would have a not responding on the progress dialog, but the led indicator on the FC was still flashing showing data xferring. After it finishes flashing the dialog box came back to life and all was good.

When it was flaky for me, it was stopping in random spots.

tannewt commented 7 years ago

I had a single one selected when it didn't work. I'll try later tonight with my F3 board to make sure the ESCs all work. I think they all have at one point or another.

On Thu, Jun 23, 2016 at 7:57 PM J Blackman notifications@github.com wrote:

If it is hanging in the same spot every time then I would suspect a problem with an ESC perhaps. Can you try flashing them individually?

I did notice that occasionally (not very often) the BlHeliSuite would have a not responding on the progress dialog, but the led indicator on the FC was still flashing showing data xferring. After it finishes flashing the dialog box came back to life and all was good.

When it was flaky for me, it was stopping in random spots.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/betaflight/betaflight/issues/481#issuecomment-228246061, or mute the thread https://github.com/notifications/unsubscribe/AADNqV2K7oglKPluTe-l-92uTQSx3EJPks5qO0eogaJpZM4IztRH .

tannewt commented 7 years ago

All of the same ESCs flashed just fine through an F3 FC.

tannewt commented 7 years ago

Well, mine is more random now. Still haven't pinpointed one thing. I believe I fixed one bug where the rx buffer size of 256 is being encoded into a uint8_t as 0 and hanging a read.

It may also be that we're ACKing faster than the host is listening for the ACK. That results in us continuing through the loop thinking we've ACKed and trying to read.

@blckmn the delay you added is at the end of the loop when it should be between the read and the write of the ACK I think. I added the delay(100) there though and its still flaky (but seems to succeed more.)

@blckmn and @4712 What ESCs did you test with? Is it possible the BLHeli_S ESCs flash faster and therefore reduce the time to ACK?

@4712 could you provide a suite that logs the time between the serial send and the serial receive on your side? I wonder how long that gap is. I think between faster ESCs and faster F4 we may be acking too quickly.

4712 commented 7 years ago

@tannewt I tested BLHeli and BLHeli_S and SimonK bootloader as well.. The response of the ESCs is always fast. If there is any lag in the FC, the communication will fail. The timing on the BLHeliSuite side is not that important. There is always the io buffer of the system involved. Also the failure mostly occur while writing to the flash not while reading (byte flow to the ESC).

Today or tomorrow, I will check with a logic analyzer on the ESC side.

tannewt commented 7 years ago

@4712 Ah OK. That must not be it then. Why would delays in the MCU loop help then? The timeout for the ACK read is five seconds, right? That seems like an awful lot of lag from the MCU.

I submitted #585 to remove the possibility of buffering ourselves into an infinite loop.

4712 commented 7 years ago

@tannewt I guess #585 could be one part of the solution. I will check...

tannewt commented 7 years ago

I hope it helps. The f3 would suffer from it too though. I figured it couldn't hurt either way.

You were right that it wasn't the timing. I confirmed today that the acknowledge packet never reached the windows box with a sniffer.

I'm still stumped as to why it's not working. It seems it hangs in the read byte calls. My next line of investigation was going to be interrupt safety of the receive length variable.

However, I'm headed to Houston on Monday to see my boards manufactured and then Michigan for a wedding. So I'll be quad less until July 6th. I'll follow along but won't be able to test anything there. I hope you all figure it out! On Sun, Jun 26, 2016 at 1:28 AM 4712 notifications@github.com wrote:

@tannewt https://github.com/tannewt I guess #585 https://github.com/betaflight/betaflight/pull/585 could be one part of the solution. I will check...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/betaflight/betaflight/issues/481#issuecomment-228590286, or mute the thread https://github.com/notifications/unsubscribe/AADNqXx0cqMPpegGVkfpMx6wYsy01NqKks5qPjgRgaJpZM4IztRH .

4712betaflight commented 7 years ago

@tannewt Bingo! #585 together with #594(#595) seems to do the job for me here with the REVO. Why didn't we have issues on F3 targets?... or did we from time to time?

4712betaflight commented 7 years ago

@blckmn Why do we need `#ifdef STM32F4 delay(50);

endif`?

It does not work without for me, somethings else seems to be still wrong with the VCP4 IO buffers...

tannewt commented 7 years ago

No way! I can't believe I missed that! Great find! I'll test it shortly on my stuff. On Sun, Jun 26, 2016 at 5:14 AM 4712betaflight notifications@github.com wrote:

@blckmn https://github.com/blckmn Why do we need

ifdef STM32F4

delay(50);

endif?

It does not work without for me, somethings else seems to be still wrong with the VCP4 IO buffers...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/betaflight/betaflight/issues/481#issuecomment-228598382, or mute the thread https://github.com/notifications/unsubscribe/AADNqVsBZrVpQkF0sK7nI76I7Uzpc61yks5qPm0MgaJpZM4IztRH .

4712 commented 7 years ago

But, I'm afraid it is not the whole solution...

tannewt commented 7 years ago

Oh no! I hope you know the rest. :-) On Sun, Jun 26, 2016 at 9:00 AM 4712 notifications@github.com wrote:

But, I'm afraid it is not the whole solution...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/betaflight/betaflight/issues/481#issuecomment-228608234, or mute the thread https://github.com/notifications/unsubscribe/AADNqQlaykm4Iq6usfIT1-j9XMvhGS2cks5qPqIegaJpZM4IztRH .

blckmn commented 7 years ago

@4712betaflight yes - i needed it to be there for it to work for me also. There are delays in the serial_cli also to prevent dropped packets when sending large quantities of info. It seems that these packets are dropped by the configurator, as they are transmitted by the FC. The configurator needs fixing to process inbound packets faster (the only speed limit is USB - as it is VCP).

Could it be a similar issue with BLHeliSuite?

4712 commented 7 years ago

@blckmn I don't think so... It is a handshaking controlled by the host. So, 263 bytes per package max. And BLHeliSuite timeouts with empty package.

blckmn commented 7 years ago

Hmmm.... USB packet size is restricted to 64 for USB full speed - 12mb/s (63 usable as one byte is reserved).

Are you saying the return packet from ESC/FC to HOST is 263 bytes? The VCP4 has a buffer of only 20 bytes, before a flush - so 13-14 packets for the 263 - this could be easily increased. It doesn't explain why the delay allows it to work though - unless packet loss is occurring - but where if not the host???.

The configurator needs the FC to send slower, otherwise it gets inundated with packets - and they get dropped. Which is why the delay works here (effectively slowing the FC down - an arbitrary max baudrate if you will).

4712 commented 7 years ago

Sorry, 264 bytes max exactly... And it does not work reliable with the delay.

tannewt commented 7 years ago

@4712betaflight We don't see this on the F3 because the lower level buffer is limited to 64 bytes. (https://github.com/betaflight/betaflight/blob/betaflight/src/main/vcp/hw_config.c#L54) The F4 one is 1024 bytes.

blckmn commented 7 years ago

@tannewt I'll test it by sending 64 bytes at a time max tonight and see if it works ok without the delay. Without the delay for me it randomly dies, with the delay it works for me - so perhaps if I can get it working without the delay we are closer.

tannewt commented 7 years ago

Was it flaky for you even after the fix from @4712betaflight? My code (https://github.com/chickadee-tech/betaflight) also has some delays in it but I'll take them out once I'm back from my trip. I flashed four ESCs today without a hiccup.

The delay calls may actually be useful because its reducing the likelihood of concurrency issues. The receiveLength variable is updated in complex ways rather than ++ and -- that could be more susceptible to concurrency problems.

4712 commented 7 years ago

@tannewt Yes it stays flaky, mostly it stops when reading back to verify. @blckmn I reduced the read/write buffer to 16 and it failed while writing.

Debugging with ST-Linkv2 suspended showed it hanging at GPIO_TypeDef* IO_GPIO(IO_t io) { ioRec_t *ioRec = IO_Rec(io); **return ioRec->gpio;** } in io.c.

Screenshot shows log in BLHeliSuite..

screenshot_83 EDIT: this is without the delay...

tannewt commented 7 years ago

Thats the new IO code. Do you have a full backtrace? I'm surprised that it would fully hang there.

4712 commented 7 years ago

No idea how to get the backtrace...

tannewt commented 7 years ago

Are you in gdb? Its "bt".

4712 commented 7 years ago

I use ChibiStudio and gdb... where to execute "bt"?

tannewt commented 7 years ago

Ah, I use gdb from the command line so "bt" short for "backtrace" in the gdb prompt. The backtrace should show all the functions that called it.

4712 commented 7 years ago

Ok found it, need to type simply in the console window... will try

tannewt commented 7 years ago

I've gotta go to sleep. Its late here and I'm getting up early. Good luck @4712! Thanks for looking into it!

4712 commented 7 years ago

542-interpreter-exec console bt ~"#0 0x08012456 in IO_GPIO (io=0x109920) at ./src/main/drivers/io.c:64\n"

0 0x08012456 in IO_GPIO (io=0x109920) at ./src/main/drivers/io.c:64

~"#1 0x00111108 in ?? ()\n"

1 0x00111108 in ?? ()

~"Backtrace stopped: previous frame identical to this frame (corrupt stack?)\n" Backtrace stopped: previous frame identical to this frame (corrupt stack?) 542^done

4712 commented 7 years ago

@tannewt good night!

blckmn commented 7 years ago

I'll be the second shift once I'm home :) I'm keen to see this working now that I'm soldering allESCs to the FC ;)

4712 commented 7 years ago

Can we turn off any other task in scheduler for testing?

blckmn commented 7 years ago

Hey @4712 we definitely have a buffer overrun or similar crash. seems the IO_t is some bogus value... but it may just be a symptom not the cause.

blckmn commented 7 years ago

I've confirmed that nothing else appears to be running whilst in the 4way loop.