candle-usb / candleLight_fw

gs_usb compatible firmware for candleLight, cantact and canable
Other
632 stars 282 forks source link

Canable device hangs after desktop application crash (= stop without disonnecting) #165

Open anicolle opened 1 year ago

anicolle commented 1 year ago

Hi all, First, thank you for this great repository. We are facing a weird problem with our Canable-based tools using candleLight_fw. We have built a small desktop application using python canopen that connect to our custom product over can with a Canable. Everything works fine except if our application is not terminated properly (killing it with linux "kill" command or stopping the vscode python debugger) thus bypassing the canopen disconnect step. In this case the Canable seems to still be running as it is still visible on USB but no CAN messages seems to reach our device. Canopen library debug shows : "Transfer aborted by client with code 0x05040000" Rebooting our Canable (by disconnecting / reconnecting it from USB) fix the problem. Any idea what could be the cause of this ? Any workaround we could use to avoid needing to disconnect / reconnect manually ? Thanks,

anicolle commented 1 year ago

Forgot to say that this happens with firmware flashed from canable website updater (commit 68df7d5 for canable v1, commit ba6b1dd for canable V2). But also with locally build firmware with latest repo commit (commit ac71cb412).

marckleinebudde commented 1 year ago
marckleinebudde commented 1 year ago

Everything works fine except if our application is not terminated properly (killing it with linux "kill" command or stopping the vscode python debugger) thus bypassing the canopen disconnect step. In this case the Canable seems to still be running as it is still visible on USB but no CAN messages seems to reach our device.

You mean you first start the app, kill the app, and restart the app, then it's not receiving data?

anicolle commented 1 year ago
  • What's your operating system?

Linux (Ubuntu)

  • Can you share an python example to reproduce the problem?

I created a simple example (cf below) that connect, read an sdo and stop without disconnecting. Only the first execution works, after that the canopen library raise a "No SDO response received" exception at each execution. Disconnecting / reconnecting allow another execution to suceed.

import canopen
import usb.core

eds_path = "youredsfilepath.eds"
node_id = 1
sdo_index = 0x1018
sdo_subindex = 1
PRODUCT_ID_CANDLE_LIGHT: int = 0x606F
VENDOR_ID_CANDLE_LIGHT: int = 0x1D50

if __name__ == "__main__":
    network = canopen.Network()
    node = canopen.RemoteNode(node_id, eds_path)

    dev = usb.core.find(
        idVendor=VENDOR_ID_CANDLE_LIGHT, idProduct=PRODUCT_ID_CANDLE_LIGHT
    )

    network.connect(
        bustype="gs_usb",
        channel=dev.product,
        bus=dev.bus,
        address=dev.address,
        bitrate=1000000,
    )

    network.add_node(node)

    print(node.sdo[sdo_index][sdo_subindex].raw)
  • Can you create and share a USB dump with wireshark?

I can't but if needed, some of my colleagues have a setup that could do that

anicolle commented 1 year ago

Everything works fine except if our application is not terminated properly (killing it with linux "kill" command or stopping the vscode python debugger) thus bypassing the canopen disconnect step. In this case the Canable seems to still be running as it is still visible on USB but no CAN messages seems to reach our device.

You mean you first start the app, kill the app, and restart the app, then it's not receiving data?

Exactly !

marckleinebudde commented 1 year ago
  • Can you share an python example to reproduce the problem?

I created a simple example (cf below) that connect, read an sdo and stop without disconnecting. Only the first execution works, after that the canopen library raise a "No SDO response received" exception at each execution. Disconnecting / reconnecting allow another execution to suceed.

Thanks. I don't have a canopen device here, can you create the matching other side, too?

anicolle commented 1 year ago

Thanks for your help. Sadly the python canopen package doesn't support slave mode. I am testing a fork that add this feature but it is a work in progress and I am not sure I will suceed to make it work.

marckleinebudde commented 1 year ago

If you are on Linux, using the socketcan interface not the USB one might be an option....

fenugrec commented 1 year ago

with socketan you can create a 'vcan0' interface, and use https://github.com/CANopenNode/CANopenLinux to create a node that listens on that interface. I only do this about once every 2 years, don't remember the finer details

marckleinebudde commented 1 year ago

My working theory is there's a problem with re-init of python-can when using the gs-usb interface. So a vcan0 wouldn't help here.

fenugrec commented 1 year ago

So a vcan0 wouldn't help here.

No indeed, that was just a suggestion to create a test setup without any actual canopen device

anicolle commented 1 year ago

Thanks to both of you. I tried to isolate the problem from canopen by running the script below (inspired from Canable Getting Started page) I can run it three times without problems, but it fails at 4th attempt image

import can
import usb.core

# Candlelight firmware on Linux
#bus = can.interface.Bus(bustype='socketcan', channel='can0', bitrate=500000)
PRODUCT_ID_CANDLE_LIGHT: int = 0x606F
VENDOR_ID_CANDLE_LIGHT: int = 0x1D50

dev = usb.core.find(
            idVendor=VENDOR_ID_CANDLE_LIGHT, idProduct=PRODUCT_ID_CANDLE_LIGHT
        )

# # Stock slcan firmware on Linux
bus = can.Bus(bustype="gs_usb",
            channel=dev.product,
            bus=dev.bus,
            address=dev.address,
            bitrate=1000000)

msg = can.Message(arbitration_id=0xc0ffee,
                  data=[0, 25, 0, 1, 3, 1, 4, 1],
                  is_extended_id=True)

try:
    bus.send(msg)
    print("Message sent on {}".format(bus.channel_info))
except can.CanError:
    print("Message NOT sent")
marckleinebudde commented 1 year ago

I've setup your new test script, and it's running in a while ./bug.py; do sleep 1; done loop. And while writing this it failed, not after 4, but ~40 loops.

fenugrec commented 1 year ago

I've setup your new test script, and it's running in a while ./bug.py; do sleep 1; done loop. And while writing this it failed, not after 4, but ~40 loops.

Interesting, before your comment I was thinking it looked like when you send 3 frames that fill up the hardware FIFO but there is no other device sending an ACK.

Tempted to blame python-can , but the errors you got may point to the device getting into a broken state. Thanks for confirming you are testing the latest build from master.

I would +1 on marc's request for a wireshark trace - wonder what's going on here

marckleinebudde commented 1 year ago

Every run of the test script starts with a GET DESCRIPTOR Request STRING:

Frame 2179: 64 bytes on wire (512 bits), 64 bytes captured (512 bits) on interface usbmon1, id 0
USB URB
    [Source: host]
    [Destination: 1.23.0]
    URB id: 0xffff9bba09d09180
    URB type: URB_SUBMIT ('S')
    URB transfer type: URB_CONTROL (0x02)
    Endpoint: 0x80, Direction: IN
    Device: 23
    URB bus id: 1
    Device setup request: relevant ('\0')
    Data: not present ('<')
    URB sec: 1683055257
    URB usec: 690974
    URB status: Operation now in progress (-EINPROGRESS) (-115)
    URB length [bytes]: 254
    Data length [bytes]: 0
    [Response in: 2180]
    Interval: 0
    Start frame: 0
    Copy of Transfer Flags: 0x00000200, Dir IN
    Number of ISO descriptors: 0
Setup Data
    bmRequestType: 0x80
    bRequest: GET DESCRIPTOR (6)
    Descriptor Index: 0x00
    bDescriptorType: STRING (0x03)
    Language Id: no language specified (0x0000)
    wLength: 254

and is answered with a GET DESCRIPTOR Response STRING within ~1ms:

Frame 2180: 68 bytes on wire (544 bits), 68 bytes captured (544 bits) on interface usbmon1, id 0
USB URB
    [Source: 1.23.0]
    [Destination: host]
    URB id: 0xffff9bba09d09180
    URB type: URB_COMPLETE ('C')
    URB transfer type: URB_CONTROL (0x02)
    Endpoint: 0x80, Direction: IN
    Device: 23
    URB bus id: 1
    Device setup request: not relevant ('-')
    Data: present ('\0')
    URB sec: 1683055257
    URB usec: 691097
    URB status: Success (0)
    URB length [bytes]: 4
    Data length [bytes]: 4
    [Request in: 2179]
    [Time from request: 0.000123000 seconds]
    Unused Setup Header
    Interval: 0
    Start frame: 0
    Copy of Transfer Flags: 0x00000200, Dir IN
    Number of ISO descriptors: 0
STRING DESCRIPTOR
    bLength: 4
    bDescriptorType: 0x03 (STRING)
    wLANGID: English (United States) (0x0409)

in the bad case, the response times out after ~1s:

Frame 2214: 64 bytes on wire (512 bits), 64 bytes captured (512 bits) on interface usbmon1, id 0
USB URB
    [Source: 1.23.0]
    [Destination: host]
    URB id: 0xffff9bb90d1789c0
    URB type: URB_COMPLETE ('C')
    URB transfer type: URB_CONTROL (0x02)
    Endpoint: 0x80, Direction: IN
    Device: 23
    URB bus id: 1
    Device setup request: not relevant ('-')
    Data: present ('\0')
    URB sec: 1683055262
    URB usec: 253151
    URB status: No such file or directory (-ENOENT) (-2)
    URB length [bytes]: 0
    Data length [bytes]: 0
    [Request in: 2213]
    [Time from request: 1.000194000 seconds]
    Unused Setup Header
    Interval: 0
    Start frame: 0
    Copy of Transfer Flags: 0x00000200, Dir IN
    Number of ISO descriptors: 0

Here's a wireshark capture. I'm not sure who triggers the first 3 request/response cycles, it may be wireshark. In the beginning I played with different sleeps, after 34s the test runs with a 3s sleep between the runs. The capture ends with the timed out GET DESCRIPTOR Response STRING.

python-bad.zip

marckleinebudde commented 1 year ago

Here a cleaner capture, again with 3s sleep between the test program calls. The candlelight is running current master firmware. It fails with the same problem.

python-bad-mainline.zip

anicolle commented 1 year ago

I did also notice that running lsusb -vin "working" (left ) and "failed"(right) states doesn't return the same results.

image

marckleinebudde commented 1 year ago

I can reproduce the lsusb problem here, too.

fenugrec commented 1 year ago

dumb question, what dependency am I missing ?

can.exceptions.CanInterfaceNotImplementedError: Cannot import module can.interfaces.gs_usb for CAN interface 'gs_usb': No module named 'gs_usb

Switching instead to the commented line bus = can.interface.Bus(bustype='socketcan', channel='can0', bitrate=1000000) works with no issues, so whatever python module provides 'gs_usb' is doing something unusual

marckleinebudde commented 1 year ago

pip install gs_usb

marckleinebudde commented 1 year ago

Switching instead to the commented line bus = can.interface.Bus(bustype='socketcan', channel='can0', bitrate=1000000) works with no issues

ACK - I suggested to use socketcan as a workaround unter Linux, too: https://github.com/candle-usb/candleLight_fw/issues/165#issuecomment-1531411613

so whatever python module provides 'gs_usb' is doing something unusual

The gs_usb user space driver is requesting GET DESCRIPTOR Request STRING on every startup, so maybe some resource exhaustion on the firmware side?

fenugrec commented 1 year ago

gs_usb

thanks. I had only looked for it in my distro's packages.

so maybe some resource exhaustion on the firmware side?

Skimming throught the ST USB stack, I don't see how that could be possible ? there's no dynamic allocation, and the string descriptor requests are handled pretty much like all other control requests. Their handling of the "len" argument is a bit dubious at times (casting to/from *uint16) but I imagine we would be seeing more problems if it was flawed.

cvaliere commented 1 year ago

hi guys, did you crack the case eventually? :)

marckleinebudde commented 1 year ago

No, I currently don't have time to chase that bug.

fenugrec commented 1 year ago

No, I currently don't have time to chase that bug.

Same here unfortunately.

the next steps would be to hook up a debugger, reproduce the problem, then examine the device state (USB periph registers, etc), and possibly set breakpoints on the get_descriptor functions.