foostan / crkbd

Corne keyboard, a split keyboard with 3x6 column staggered keys and 3 thumb keys.
Creative Commons Attribution 4.0 International
5.62k stars 982 forks source link

usb connection issues #265

Open xkonni opened 1 month ago

xkonni commented 1 month ago

got 2 crkbd rev 4.1, love them, typing on my old 60% is a pain now.

but for some reason the usb connections on both devices are rather unstable on my machines (linux pc, 2 dell laptops with linux). first I thought it was a hw issue, but the second (one from a diy store in germany, one from aliexpress) shows the exact same issues.

using your firmware with the vial keymap. tried some options (remove USB_SUSPEND_WAKEUP_DELAY, increase it, ...) but the devices remain unstable. sometimes they run for hours, then they fail every few seconds.

Could this be related to https://github.com/foostan/crkbd/issues/229 ?

any help is highly appreciated!

logs:

Sep 29 21:15:06 annoyance kernel: input: foostan Corne v4 as /devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:08.0/0000:2a:00.1/usb1/1-3/1-3:1.0/0003:4653:0004.0058/input/input158
Sep 29 21:15:06 annoyance kernel: hid-generic 0003:4653:0004.0058: input,hidraw6: USB HID v1.11 Keyboard [foostan Corne v4] on usb-0000:2a:00.1-3/input0
Sep 29 21:15:06 annoyance kernel: hid-generic 0003:4653:0004.0059: hiddev99,hidraw7: USB HID v1.11 Device [foostan Corne v4] on usb-0000:2a:00.1-3/input1
Sep 29 21:15:06 annoyance kernel: input: foostan Corne v4 Mouse as /devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:08.0/0000:2a:00.1/usb1/1-3/1-3:1.2/0003:4653:0004.005A/input/input159
Sep 29 21:15:06 annoyance kernel: input: foostan Corne v4 System Control as /devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:08.0/0000:2a:00.1/usb1/1-3/1-3:1.2/0003:4653:0004.005A/input/input160
Sep 29 21:15:06 annoyance kernel: input: foostan Corne v4 Consumer Control as /devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:08.0/0000:2a:00.1/usb1/1-3/1-3:1.2/0003:4653:0004.005A/input/input161
Sep 29 21:15:06 annoyance kernel: input: foostan Corne v4 Keyboard as /devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:08.0/0000:2a:00.1/usb1/1-3/1-3:1.2/0003:4653:0004.005A/input/input162
Sep 29 21:15:06 annoyance kernel: hid-generic 0003:4653:0004.005A: input,hidraw8: USB HID v1.11 Mouse [foostan Corne v4] on usb-0000:2a:00.1-3/input2
Sep 29 21:15:09 annoyance kernel: usb 1-3: reset full-speed USB device number 50 using xhci_hcd
Sep 29 21:15:09 annoyance kernel: usb 1-3: device descriptor read/all, error -71
Sep 29 21:15:09 annoyance kernel: usb 1-3: reset full-speed USB device number 50 using xhci_hcd
Sep 29 21:15:09 annoyance kernel: usb 1-3: device descriptor read/64, error -71
Sep 29 21:15:09 annoyance kernel: usb 1-3: device descriptor read/64, error -71
Sep 29 21:15:10 annoyance kernel: usb 1-3: reset full-speed USB device number 50 using xhci_hcd
Sep 29 21:15:10 annoyance kernel: usbhid 1-3:1.0: can't add hid device: -71
Sep 29 21:15:10 annoyance kernel: usbhid 1-3:1.0: probe with driver usbhid failed with error -71
Sep 29 21:15:11 annoyance kernel: usb 1-3: reset full-speed USB device number 50 using xhci_hcd
Sep 29 21:15:11 annoyance kernel: usb 1-3: reset full-speed USB device number 50 using xhci_hcd
Sep 29 21:15:13 annoyance kernel: usb 1-3: reset full-speed USB device number 50 using xhci_hcd
Sep 29 21:15:13 annoyance kernel: usb 1-3: device descriptor read/64, error -71
Sep 29 21:15:13 annoyance kernel: usb 1-3: device descriptor read/64, error -71
Sep 29 21:15:13 annoyance kernel: usb 1-3: reset full-speed USB device number 50 using xhci_hcd
Sep 29 21:15:13 annoyance kernel: usb 1-3: device descriptor read/64, error -71
Sep 29 21:15:14 annoyance kernel: usb 1-3: device descriptor read/64, error -71
Sep 29 21:15:14 annoyance kernel: usb 1-3: reset full-speed USB device number 50 using xhci_hcd
Sep 29 21:15:15 annoyance kernel: usb 1-3: reset full-speed USB device number 50 using xhci_hcd
Sep 29 21:15:15 annoyance kernel: usb 1-3: device firmware changed
Sep 29 21:15:15 annoyance kernel: usb 1-3: USB disconnect, device number 50
Sep 29 21:15:16 annoyance kernel: usb 1-3: new full-speed USB device number 51 using xhci_hcd
Sep 29 21:15:16 annoyance kernel: usb 1-3: unable to read config index 0 descriptor/all
Sep 29 21:15:16 annoyance kernel: usb 1-3: can't read configurations, error -71
Sep 29 21:15:16 annoyance kernel: usb 1-3: new full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:16 annoyance kernel: usb 1-3: New USB device found, idVendor=4653, idProduct=0004, bcdDevice= 4.10
Sep 29 21:15:16 annoyance kernel: usb 1-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Sep 29 21:15:16 annoyance kernel: usb 1-3: Product: Corne v4
Sep 29 21:15:16 annoyance kernel: usb 1-3: Manufacturer: foostan
Sep 29 21:15:16 annoyance kernel: usb 1-3: SerialNumber: vial:f64c2b3c
Sep 29 21:15:16 annoyance kernel: input: foostan Corne v4 as /devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:08.0/0000:2a:00.1/usb1/1-3/1-3:1.0/0003:4653:0004.005B/input/input163
Sep 29 21:15:16 annoyance kernel: hid-generic 0003:4653:0004.005B: input,hidraw6: USB HID v1.11 Keyboard [foostan Corne v4] on usb-0000:2a:00.1-3/input0
Sep 29 21:15:16 annoyance kernel: hid-generic 0003:4653:0004.005C: hiddev99,hidraw8: USB HID v1.11 Device [foostan Corne v4] on usb-0000:2a:00.1-3/input1
Sep 29 21:15:16 annoyance kernel: input: foostan Corne v4 Mouse as /devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:08.0/0000:2a:00.1/usb1/1-3/1-3:1.2/0003:4653:0004.005D/input/input164
Sep 29 21:15:16 annoyance kernel: input: foostan Corne v4 System Control as /devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:08.0/0000:2a:00.1/usb1/1-3/1-3:1.2/0003:4653:0004.005D/input/input165
Sep 29 21:15:16 annoyance kernel: input: foostan Corne v4 Consumer Control as /devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:08.0/0000:2a:00.1/usb1/1-3/1-3:1.2/0003:4653:0004.005D/input/input166
Sep 29 21:15:16 annoyance kernel: input: foostan Corne v4 Keyboard as /devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:08.0/0000:2a:00.1/usb1/1-3/1-3:1.2/0003:4653:0004.005D/input/input167
Sep 29 21:15:16 annoyance kernel: hid-generic 0003:4653:0004.005D: input,hidraw9: USB HID v1.11 Mouse [foostan Corne v4] on usb-0000:2a:00.1-3/input2
Sep 29 21:15:20 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:21 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:23 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:23 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:24 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:27 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:27 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:28 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:31 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:31 annoyance kernel: usb 1-3: Device not responding to setup address.
Sep 29 21:15:31 annoyance kernel: usb 1-3: Device not responding to setup address.
Sep 29 21:15:31 annoyance kernel: usb 1-3: device not accepting address 52, error -71
Sep 29 21:15:31 annoyance kernel: usb 1-3: WARN: invalid context state for evaluate context command.
Sep 29 21:15:31 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:31 annoyance kernel: xhci_hcd 0000:2a:00.1: ERROR: unexpected setup context command completion code 0x11.
Sep 29 21:15:31 annoyance kernel: usb 1-3: hub failed to enable device, error -22
Sep 29 21:15:31 annoyance kernel: usb 1-3: WARN: invalid context state for evaluate context command.
Sep 29 21:15:31 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:31 annoyance kernel: xhci_hcd 0000:2a:00.1: ERROR: unexpected setup address command completion code 0x11.
Sep 29 21:15:32 annoyance kernel: xhci_hcd 0000:2a:00.1: ERROR: unexpected setup address command completion code 0x11.
Sep 29 21:15:32 annoyance kernel: usb 1-3: device not accepting address 52, error -22
Sep 29 21:15:32 annoyance kernel: usb 1-3: WARN: invalid context state for evaluate context command.
Sep 29 21:15:32 annoyance kernel: usb 1-3: reset full-speed USB device number 52 using xhci_hcd
Sep 29 21:15:32 annoyance kernel: xhci_hcd 0000:2a:00.1: ERROR: unexpected setup address command completion code 0x11.
Sep 29 21:15:32 annoyance kernel: xhci_hcd 0000:2a:00.1: ERROR: unexpected setup address command completion code 0x11.
Sep 29 21:15:32 annoyance kernel: usb 1-3: device not accepting address 52, error -22
Sep 29 21:15:32 annoyance kernel: usb 1-3: USB disconnect, device number 52
foostan commented 1 month ago

Thank you for the information. I have received some reports, but I have not yet been able to identify what the cause is. I will review some of the design policies and try to improve them.

xkonni commented 1 month ago

if you need any further information or have an idea how to fix existing pcbs I'm all ears!

l4u commented 1 month ago

@xkonni can you let us know the distro and kernel versions please?

xkonni commented 1 month ago

sure, here are my 3 test computers

foostan commented 1 month ago

Is yours cherry or chocolate? How about the communication between the left and right sides. Is the USB connection unstable only?

xkonni commented 1 month ago

I got two 4.1 here, one cherry, one choc. they behave exactly the same. On a usual day they work. Does not matter which one I use.

Then after a while the usb issues appear. Changing usb from left to right does not help, switching cherry to choc does not help. My left side is normally plugged in via usb, right via trrs. Sometimes the left side still works, right does not. But then replugging the left or switching to the right just leads to more usb errors in the kernel log.

A cold boot sometimes helps.

foostan commented 1 month ago

Thank you for sharing the details!

PaulRopel commented 2 weeks ago

I’m also experiencing some disconnect issues with my Core v4.1. When I plug it in and use it to practice on keybr.com, it works well. However, if I set it aside (while it’s still plugged in), switch to browsing, or use my Mac keyboard, the Core v4.1 stops responding when I try to use it again, even though the LED remains on. Tell me if I can help somehow troubleshooting...

dahmwern commented 2 weeks ago

I'm having similar issues as well. Seems to be that the non-plugged side disconnects most often, but sometimes I'll get disconnects on the plugged in side as well. I saw the LEDs flicker when this happened, which isn't suprising, but it was a series of very short bursts of flickers which makes me think it's a Power-related issue.

foostan commented 2 weeks ago

This is just a guess, but from some reports I've heard it seems to be a power supply issue. There are some parts that are not very well designed, and some of them may be defective.

I'd like to isolate some of the root causes, and I'd appreciate any information you could give me.

dahmwern commented 2 weeks ago

I'd like to isolate some of the root causes, and I'd appreciate any information you could give me.

* Does the problem happen on a different PC?

* Does the problem happen on another Corne v4 (if you have one)?

I've had the issue on a Mac. I do have a spare PC that I can test with later this weekend and report back.

No spare Corne v4.1 keyboards assembled to test with easily.

dahmwern commented 2 weeks ago

Update:

Set up:

  1. Connected via USB C to MacBook Pro directly and with USB C hub
  2. USB C to right half of keyboard
  3. TRS between halves

I used the Corne V4.1 all day today, about 10 hours of use during work. I experienced a total of about 10 losses of function, some back to back, with varying amount of time between them.

Left hand (slave side) had about 6-7 losses of function. Right hand (master side) had about 3-4 losses of function. On one occasion, losses of function occurred every 30 seconds and required the keyboard to be reflashed.

Each time there was loss of function, it was preceded by LED flickering.

Hope this helps! I'm happy to set up more Corne V4.1s to test in varying conditions.

dahmwern commented 1 week ago

Another update:

I swapped my keyboard out with another Corne v4.1 PCB this evening. I did this to verify that there were no hardware issues with the first PCB. I also used the same firmware to avoid SW variation.

I confirmed the same behavior with keyboard lockup on one side resulting in requiring a power cycle to recover.

This is a big issue! Right now I can't use my (5) Corne v4.1's nor can I use a v4.1 as my daily driver with these reliability issues.

@foostan have you looked into this any further?

foostan commented 1 week ago

Thank you for your confirmation. Unfortunately, this problem does not occur in my environment, so I cannot investigate further.

alessiocurri commented 1 week ago

Hi, i can report i have the same issue. The keyboard locks up so much it's impossible to use. I tested the keyboard with two different set of pcbs (both chocolate), with multiple computers (mostly linux, a windows out of desperation). I also tried flashing a custom KMKFw one-side-only setup and, later, a custom QMK firmware. Multiple USB cables, HUBs, no Hubs, Hid-remapper in front of the keyboard. Same result. The two pcbs were sourced from different vendors in Europe, i tested both.

How can we help you further investigate this issue?

edit: i forgot to add, the keyboards seems to lock a less with QMK.

chadhakala commented 1 week ago

FYI the second USB port will work (opposite hand) however your special keybinds may behave differently from your custom layout; found this to be a pleasant surprise considering a USB port joint was damaged on mine and the opposite ha d allows me to work around the one broken USB jack. Not sure of this will solve your problem but worth a shot and worth knowing it appears to be different from older branches in that way.

foostan commented 1 week ago

Another possibility is that the PCB is simply damaged. Please also contact your supplier for further information.

alessiocurri commented 1 week ago

@foostan 4 different PBCs from 2 different vendors show the exact same exact issue, both used as a pair and as a single unit (with a custom firmware). The same issue reported by other user in the this thread. I assembled the keyboard myself, and inspected the second set of PCB i got very carefully when i received them: the only reason I bought a second set was to test if my unit was the issue. The custom software was tested on an generic RP2040, to test the stability: no issues for days while the same (KMKfw) software running on the corne has usb issues after a few minutes. I can reproduce this with all my 4 units (2 left and 2 right ones) and it works fine on any other RP2040 i tested.

I had spent quite a lot of time trying to debug and i'm 100% positive it's not a single unit, it's not my computer, the usb cable or simila.

What i'm hoping to get here is some help in further debugging what is an issue with the USB on the keyboard, and hopefully find a solution/workaround to help the other user that may have the same issue.

So, in that light, is there any other info i can provide?

@chadhakala no, the usb port is not damaged at all.

foostan commented 1 week ago

Thank you for sharing the details. I'm glad you're being helpful.

So what you're reporting means is that the issue is more likely to occur with KMKfw than with QMK? I'll give KMKfw a try. Thanks again.

dahmwern commented 1 week ago

@foostan I don't think he's saying the KMKfw is worse, but rather by testing the same firmware on a generic RP2040 and on the Corne v4.1 board, the issue is only present on the Corne v4.1. This eliminates as many noise factors as conveniently possible.

The help needed is some debugging on the Corne v4.1 USB HW design to understand what's unique to the design that's causing the issue.

Please let us know if you need data. I am fully willing to support as needed. I would love to help solve this.

foostan commented 1 week ago

I'm sorry, of course. I didn't mean KMKfw is worse. I would like to isolate the problem and investigate the cause in detail.

Thank you for your cooperation. Let's share information on this issue.

ChadHacksaLot commented 1 week ago

@foostan 4 different PBCs from 2 different vendors show the exact same exact issue, both used as a pair and as a single unit (with a custom firmware). The same issue reported by other user in the this thread. I assembled the keyboard myself, and inspected the second set of PCB i got very carefully when i received them: the only reason I bought a second set was to test if my unit was the issue. The custom software was tested on an generic RP2040, to test the stability: no issues for days while the same (KMKfw) software running on the corne has usb issues after a few minutes. I can reproduce this with all my 4 units (2 left and 2 right ones) and it works fine on any other RP2040 i tested.

I had spent quite a lot of time trying to debug and i'm 100% positive it's not a single unit, it's not my computer, the usb cable or simila.

What i'm hoping to get here is some help in further debugging what is an issue with the USB on the keyboard, and hopefully find a solution/workaround to help the other user that may have the same issue.

So, in that light, is there any other info i can provide?

@chadhakala no, the usb port is not damaged at all.

@alessiocurri My apologies--I didn't realize this thread was all about the lockup; while,I have faced this issue and other unique issues for which I do not have systematic evidence for being a USB fault.

The last time I used the corne I did face this exact lock up issue and stop using it completely for that reason, I am following all these threads so my apologies, little embarassed for chiming in didn't even read the full thread; I'm pretty sure I meant to respond to a different comment in the thread and was unaware there was even an issue for lockup.

alessiocurri commented 1 week ago

@dahmwern exactly what i meant ;)

@foostan here https://gist.github.com/alessiocurri/18e6b0c48a74c37dee766a71a22ac62a you can find my config for a left-only corne 4.1, no TRRS cable nor right side necessary. This script will run fine on any circuitpython, i tried on versions 8.x, 9.1 and 9.2 (no changes).

To install kfmfw I just copied the kmkfw files from the github repo, added the neopixel.py library (you can also use the .pyc, it should be the same) and my code.py and boot.py (the latter is not strictly necessary). Please note the default layer is empty. To test you need to switch to another layer, the leds will highlight the active keys.

In this config i can replicate the usb lockup on average in 20 minutes, using all 4 boards (tested without the switches, with, no difference). The easier way to check the status is to use a serial console con the virtual com port exposed. There you can find the python REPL. That virtual serial port will disappear when the usb issue presents itself.

@ChadHacksaLot no prob at all, probably it's me owning an apology... in my reply i have been very blunt, probably a tad much :)

viscount-monty commented 1 week ago

Just wanted to add that I'm experience what sounds like the exact same issue with my corne v4.1.

Same behaviour everyone above is describing - sometimes one side becomes unresponsive, sometimes both, sometimes it works nearly all day, sometimes it's only seconds or minutes until the next lockup after disconnecting and reconnecting the USB cable.

One time, the right side even changed colour to the pattern pictured below: image

Same behaviour when plugged into

I absolutely adore this keyboard when it works, I would love to assist in some way. I have career experience in PCB design and experience in micro-controller firmware programming - let me know what I can do to help or point me in a direction please :)

dahmwern commented 1 week ago

@viscount-monty I've experienced that same LED pattern during lockup. Your description is consistent with my experience.

What information do you need to analyze the PCB design for potential USB related comms issues?

foostan commented 1 week ago

It seems that it may or may not occur depending on the environment. I don't know under what conditions it occurs, but has anyone noticed any electrical abnormalities when the problem occurs, such as a short interruption or a significant drop in voltage or current?

dahmwern commented 1 week ago

@foostan I have not noticed any abnormalities. I have experienced this at both work and at home. All my other peripherals continue to function normally (USB C mouse, USB C dongles, Bluetooth connected devices).

Thinking back, the occurrences seem completely random, there's no consistent environmental trigger.

That makes me think the USB connection itself is on the border of instability always and when slight fluctuations on the bus occur, the connection drops.

alessiocurri commented 1 week ago

@foostan i've checked with both a cheap usb analyzer in front of the keyboard and by adding in series an usb-c power injector to make sure this is not an issue with the 5v rail. I think it's environmental tho, and EM: I think the issue happens more often when my mobile is close to the keyboard (which is most of the time). But i have no hard data to show.

To test if this an issue with the USB traces, as hinted by @fabianmuehlberger in #229, I bought an usb isolator to see if going usb full-speed instead of high speed makes any difference. I went the isolator way because i have not found a config to force an USB port to go less than high speed. I'm also planning of using an usb-c power injector from a battery to overcome the 200ma limit of the (very old...) insulator. It will arrive on Monday, i will test then if this changes anything.

viscount-monty commented 6 days ago

@dahmwern I've got the PCB design files open in KiCad and was looking through them when I had a minor breakthrough - hopefully others can confirm/replicate. As I'm sure @foostan can attest, debugging a PCB for noise issues can be a potentionly time, labour, cost, and equipment intensive process, so ideally the scope is narrowed as much as possible prior to beginning the process.

Thanks in part to @alessiocurri for giving me the idea of EM interference from a phone which I was able to investigate more thoroughly.

I put together a quick and dirty bash script (gist) to monitor the connection state of my corne keyboard, logging and notifying me of every time it disconnects.

I found that the keyboard would remain connected without issues all night while I was sleeping (in another room) with my phone also out of the room.

I then found that I could reliably cause a disconnection by placing my phone on or near either side of the keyboard. Note that placing it near the non-USB connected side causes and undectable lock up of that side, with the other side (USB connected, not in close proximity to phone), well, undectable to the bash script. When you're using the keyboard you certainly notice one side stop working!

At this stage it seems very much confirmed that the disconnects are caused by EMI, a common source being a smartphone, but that is not a great sample size!

First step: could everyone please try to replicate and confirm the issue? We need to confirm the following to move forward with this theory:

  1. That the corne v4.1 does NOT disconnect for a significant period of time (12 - 24 hours) when it is not close to
    • smartphones
    • I don't currently suspect this particular badwidth/protocol/power level, but if moving theses items away fixes someone elses issue, it would be important to know.
      • wireless routers
      • bluetooth devices

Thanks to everyone's input so far, and thanks in advance for any attempts to replicate the issue and gather the data/logs to back it up!

foostan commented 5 days ago

Thank you for sharing! That script looks good 👍 I'll try to replicate and confirm the issue.

dahmwern commented 5 days ago

@viscount-monty great work! I'll test this today as well.

viscount-monty commented 5 days ago

Thanks! And my apologies - I forgot to change the specific device ID in the lsusb command and just grep for corne.

Let me know if anyone needs a hand getting the script up and running 🤔

alessiocurri commented 5 days ago

"Good" news/bad news situation:

"Good" news: Now that we "know" EMI are the probable issue, i can reproduce it in 20 seconds, by moving my mobile closer and switch off the wifi. This (i assume) makes the 4/5g modem talk and the keyboard is in a non-functional state in seconds. By closer i mean around 20cm from the usb connected side (the left, in my case). Bad: The insulartor makes no difference at all, probably because it's an USB2.0 device, and not 1.1 as described in the amazon page :rage: .

Btw, this shades a bit of light on why (in my experience) QMK was more "stable": i was testing it on another machine during the night, while not there.

I still would like to test forcing USB1.1 (mostly to find a use for the keyboard and not have to just recycle the switched in another project), but have no idea how. If anybody has a suggestion, or has an idea on how to try and shield the USB traces, it would be more than welcome.

dahmwern commented 5 days ago

@alessiocurri you hit the nail on the head! I can trigger the failure by following the identical procedure!

Phone with WiFi on: No impact Turn WiFi off: Immediately BOTH my test Corne v4.1s locked up at the same time.

I'm going to run the script over night with both Corne v4.1's connected, far away from any connected devices and see if they are stable for a 24 hour period.

dahmwern commented 5 days ago

@viscount-monty can you modify the script to run on Windows? My work machine is a mac but I'm running this test on my SurfacePro spare computer. I'm not really a linux user.

foostan commented 5 days ago

I turned off the Wi-Fi on my iPhone and moved it closer to corne to continue data communication, and the current situation where the closer one gets locked was reproduced. I have not yet been able to reproduce the problem stably, so I will continue to look for a way to do so for a while.

dahmwern commented 5 days ago

@foostan I have a theory there... That could be down to 5G band variation internationally. Google doesn't seem to bring up much easy consensus on bands in Japan quickly.

I will need to research 5G band variations globally to understand if that's a reasonable limitation on testing repeatability. At the very least, it should be considered an environmental factor that may impact individual testing.

viscount-monty commented 5 days ago

@alessiocurri nice work - disabling the WiFi on my phone allowed me to more reliably replicate the issue by placing my phone near the corne.

To everyone attempting to replicate - does switching off the WiFi AND starting a data-intensive process (like playing a YouTube video) replicate the issue more reliably? This should trigger a whole bunch of cellular transmissions. Without this they would be potentially very brief and sporadic.

And, inversely, when placing the phone in 'aeroplane mode' so that wireless features are turned off, does this allow you to place the phone on or near the corne for an extended period without experiencing a disconnect? While initial brief tests seem to indicated that this is the case, unfortunately I can't test this for an extended period until later in the evening as I am expecting SMSs and a phone call 🤣️

@foostan just to confirm, the computer cannot detect a lockup of the non-USB connected side, only the user can notice either no input at all, or a key getting 'stuck' and continuing to repeat without user input (I'm guessing the last command received from the non-USB connected side must have been a 'key down'.

@dahmwern unfortunately I know very little about windows powershell/batch scripts. Would you be able to run a python script? I could probably throw something together quickly. Also, excellent point about the variation in 5G bands: the wikipedia article lists 5G bans from~450 Mhz through to multiple GHz... Wild.

themrb commented 4 days ago

I did some testing with airplane mode -

Having my phone next to the keyboard and turning on airplane mode immediately causes the secondary half to become unresponsive (note to be clear my airplane mode does not affect wireless or bluetooth). N=2

Having my phone on airplane mode next to the keyboard does not appear too have any impact, at least over a short duration

Turning airplane mode off does not immediately impact the keyboard, but makes it unresponsive after a few seconds. N=2

If you are typing at the time the unresponsiveness is triggered, the key gets stuck eg you type ttttttttttttttttttttttttttttt

Location/ 5G bands are Australia

alessiocurri commented 3 days ago

To find a workaround, yesterday i tired with some aluminium foil (the food safe one i got from the kitchen). I wrapped the keyboard completely (including the keys...) in a 2/3 layers and run a simple python code (print i++, in a nutshell) on the rp2040. The keyboard did not get stuck while trying my worst with my mobile. While this works, it's not the most practical solution (plus your partner may think you a bit crazy...). I ordered some conductive filament and i will try to reprint the case and the top plate with that. I'm not sure the filament (with a resistance of 1.5ohm per cm) will work as well as the foil, nor if the (very necessary) gaps for the switches are an issue. The alternative is to order a full metal case, but that looks expensive for a test, around 70 euros per side. Hopefully i will have some results with the conductive filament, good or bad, in around a week.

fabianmuehlberger commented 3 days ago

Hey @alessiocurri since you mentioned me and the issue I brought up. This really looks like an EMI problem with the USB differential pair. wTo be clear, I am not high speed signal or EMI engineer, I am just an hobbyist. So I counld not test your hypothesis, since I am missing expesive testing equipment, and knowledge. Not sure how to debug this hardware issue without the propper tools, usually a test signal would be sent through the lines and an eye pattern would be measured, A high speed oscilloscope would be needed, as well as an isolated environment. Maybe the easier option is to make a test PCB: move the IC to a suitable place with propper USB routing, and build it.

alessiocurri commented 3 days ago

@fabianmuehlberger you are absolutely right, the right thing to do would be to move the IC closer to the USB port, or in general re-route to maintain signal integrity 'till the micro. The test (with the filament) are more for the people (like me) that will not get a new board and wants to try and savage the situation.

ps Thank you for bringing up the issue in the first place, it helped a lot when trying to find the issue.

george-norton commented 3 days ago

Just wild speculation on my part. I don't own a corne-v4. But @viscount-monty suggested that the slave half was similarly effected, which suggests maybe the USB thing is a red herring.

The crystal used is not a part I have seen in an RP2040 design before (although perhaps it is well used in Japanese keyboard scene?). The RP2040 hardware design guide is very explicit that deviating from the recommended crystal may lead to instability. The selected part has quite different load capacitance to the Abracon ABM8-272-T3, so perhaps the values of the load capacitors and series resistor need tweaking? Maybe shielding just the crystal might be an interesting experiment?

foostan commented 3 days ago

Thank you for all your various suggestions, they are accommodating and informative. It seems like a good idea to check the basics, such as moving the IC closer to the USB, paying attention to the length and thickness of the wires, and reviewing the crystal parts. I'm open to any suggestions here and will make a prototype. Thank you again for your cooperation.

foostan commented 3 days ago

Considering the current case, the distance between the IC and the USB doesn't seem to matter much, because this issue also occurs at one side which is not connected USB.

george-norton commented 3 days ago

If you are familiar with the SWD interface and can reproduce the failure, you may be able to power the board via the OLED header, connect to SWD and see if you can get the MCU to fail without any USB connection present at all.

alessiocurri commented 3 days ago

@george-norton i tested this with circuitpython and a simple program that light the leds in sequence: when the USB goes down, the mcu is still running fine. I think the processor itself is not affected, but the software is: when KMKfw try to send data via USB, the python core crashes after a few seconds. I assume QMK as a similar behavior, but i have not tested it.

@foostan I'm an hobbist in this field (electronics), so this is a bit of guesswork. I think the secondary unit is stuck because it cannot sync with the primary, which in turn is stuck trying to write on the USB (at least with the default qmk firmware). I have verified both the MCUs of the primary and the secondary unit are still alive (with the aforementioned script) and will talk via the interconnecting cable (in my example i'm flipping up and down the gpio conneted to the TRS cable at ~1Hz on one side, and turn on a led on the other. This is to say that i would consider the other side being stuck only a side effect of not being able to talk to the primary side.

fabianmuehlberger commented 3 days ago

@george-norton You are right. It would be beneficial to rule out other causes. I assume this could be tested my running the RP2040 with some testing C or Python code, just logging the behavior via the debug serial header. For this test, the USB lines should be fully deactivated, (disabled GPIOs) to mitigate interference.

I also assume the error codes could indicate the problem we are facing here. The question is: What USB connection issues and logs are present in the following cases? And are there other signs indicating the root cause.

  1. In case of EMI affecting the Crystal.
  2. Insufficient Power for the MCU.
  3. EMI problem in USB differential pair.

Considering the current case, the distance between the IC and the USB doesn't seem to matter much, because this issue also occurs at one side which is not connected USB.

The main reason I mentioned this was, that the design is not within the specification for routing USB lines. This does not say it has to be a problem, but is certainly one of the first areas to look at when problems like this occur.

alessiocurri commented 2 days ago

I tested the case printed with conductive filament but, as expected tbh, there was no change. At this point I'm out of tests I can do.

@foostan while you investigate, i think you should add a warning in the readme.md file about this issue.

foostan commented 2 days ago

Add a notice about this issue on README https://github.com/foostan/crkbd?tab=readme-ov-file#notice