BetzDrive / bldc-controller-hardware

Hardware design files for BLDC servo controller
Other
16 stars 11 forks source link

Significantly reduced communication reliability between v2.0 and v2.2 #9

Closed gbalke closed 4 years ago

gbalke commented 4 years ago

When running some communication frequency tests, I observed unusually high communication error rates with v2.2 boards. I thought this was just a fluke but I decided to swap out for a v2.0 board to see if the same issues would exist. To my surprise, the v2.0 board had far superior performance.

The setup I used (note I hot plugged the 5 pin output which broke the RS485 conversion circuitry on this interface board, it worked just fine prior to that): 119176382_685176595415906_7781205890179192678_n

Here are the results of each board while running a 1000 packet frequency test:

v2.0
[ INFO] [1600023829.823944194]: comm time dt: 0.001020, freq: 980.789576, error rate: 0.00%
[ INFO] [1600023830.885020693]: comm time dt: 0.001061, freq: 942.431438, error rate: 0.00%
[ INFO] [1600023831.949970126]: comm time dt: 0.001065, freq: 939.006689, error rate: 0.00%
[ INFO] [1600023833.017809807]: comm time dt: 0.001068, freq: 936.479761, error rate: 0.00%
[ INFO] [1600023834.207183691]: comm time dt: 0.001189, freq: 840.779238, error rate: 0.00%
[ INFO] [1600023835.278650344]: comm time dt: 0.001071, freq: 933.292096, error rate: 0.00%
[ INFO] [1600023836.526089476]: comm time dt: 0.001247, freq: 801.641464, error rate: 0.10%
[ INFO] [1600023837.562513888]: comm time dt: 0.001036, freq: 964.855147, error rate: 0.10%
[ INFO] [1600023838.660987897]: comm time dt: 0.001098, freq: 910.355054, error rate: 0.00%
[ INFO] [1600023839.769512690]: comm time dt: 0.001109, freq: 902.098902, error rate: 0.00%
[ INFO] [1600023840.896899233]: comm time dt: 0.001127, freq: 887.011400, error rate: 0.10%
[ INFO] [1600023841.988356426]: comm time dt: 0.001091, freq: 916.205157, error rate: 0.10%
[ INFO] [1600023843.113874297]: comm time dt: 0.001126, freq: 888.479058, error rate: 0.00%
[ INFO] [1600023844.206556070]: comm time dt: 0.001093, freq: 915.178266, error rate: 0.00%
v2.2
[ INFO] [1600023744.000228761]: comm time dt: 0.001210, freq: 826.165763, error rate: 1.00%
[ INFO] [1600023745.120665772]: comm time dt: 0.001120, freq: 892.501351, error rate: 0.50%
[ INFO] [1600023746.273147391]: comm time dt: 0.001152, freq: 867.689618, error rate: 1.10%
[ INFO] [1600023747.410961965]: comm time dt: 0.001138, freq: 878.880324, error rate: 0.50%
[ INFO] [1600023748.614074245]: comm time dt: 0.001203, freq: 831.176878, error rate: 1.20%
[ INFO] [1600023749.759499979]: comm time dt: 0.001145, freq: 873.039530, error rate: 0.60%
[ INFO] [1600023750.959934337]: comm time dt: 0.001200, freq: 833.032582, error rate: 0.50%
[ INFO] [1600023752.165648419]: comm time dt: 0.001206, freq: 829.379070, error rate: 1.20%
[ INFO] [1600023753.342919607]: comm time dt: 0.001177, freq: 849.433090, error rate: 1.00%
[ INFO] [1600023754.485081835]: comm time dt: 0.001142, freq: 875.524857, error rate: 0.50%
[ INFO] [1600023755.651654054]: comm time dt: 0.001167, freq: 857.213028, error rate: 0.80%
[ INFO] [1600023756.850065793]: comm time dt: 0.001198, freq: 834.437304, error rate: 1.50%
[ INFO] [1600023757.963656416]: comm time dt: 0.001114, freq: 897.995186, error rate: 0.80%
[ INFO] [1600023759.199031016]: comm time dt: 0.001235, freq: 809.472216, error rate: 1.10%

I reviewed both schematics and noticed that the circuits are identical

v2.2: image

v2.0: image

This leaves me thinking that the only issue could arise from a discrepancy in layout. When I browsed the layout, I noticed that, if anything, the v2.2 boards have a path that is more ideal than the v2.0 boards (much less distance and vias between the RS485 chip and the connector). Because of this, I looked more generally at the board design. I noticed that in v2.0, the majority of the pours are for GND. This is not the case for v2.2 where In1 is primarily 3.3v, In2 is GND, Back is GND, and Front is a mixture of 48V and GND over the sections through which comms travels. My main concern here is interference as I know signals can be finicky. Another idea is that the RS485 chip was moved closer to the connectors (which means RS485 will be better off) but the on-board UART now has further to travel, and, consequently, suffers more from the on-board noise. This may be a bad trade-off as RS485 is a much more robust signal.

I've attached some oscilloscope readings to show what a missed response looks like when talking with two boards (using the math operation to show the differential pair). Two boards are used as the error rate increases (N^2) with the number of boards, making it easier to catch these on the scope. The error rate when using two boards is about 4+-2%. While obviously not precise enough to determine signal integrity, this is mostly meant to show the observed behavior in the interest of reproducing the issue. The close-ups that I viewed have sharp signal edges (to be expected, this is only ~1MHz).

This is a normal, successful transmission. two_replies

Based on the protocol, one communication error occurs when either the second board misses the host message or misses the reply of the first board. one_reply

If the first board misses the host message, regardless of what happens down-stream, no boards will respond. This is due to a counter on each board which is set by the board's message index in the host packet. When a board observes a reply from any board, it decrements this counter. When its counter reaches zero (or 1 I can't remember), it sends its reply! no_reply

Any help on identifying what could be the problem would be greatly appreciated! I'll require that this be the primary driver of the v2.3 release.

Edit: Note that both the v2.0 and v2.2 boards I used in testing are from the same manufacturer. The same errors are found in another batch of v2.2 boards from a different manufacturer.

Edit 2: Here's some printouts of the schematics. v2.0 schematic v2.2 schematic

gbalke commented 4 years ago

@codebot @luca-della-vedova any thoughts?

gbalke commented 4 years ago

I removed U7 and terminated the RS485 directly by connecting A and B. Does not seem to have improved error rates:

[ INFO] [1600044100.806813912]: comm time dt: 0.001232, freq: 811.909066, error rate: 1.00%
[ INFO] [1600044101.984372867]: comm time dt: 0.001178, freq: 849.200731, error rate: 0.70%
[ INFO] [1600044103.207265780]: comm time dt: 0.001223, freq: 817.740651, error rate: 0.60%
[ INFO] [1600044104.280283827]: comm time dt: 0.001073, freq: 931.944490, error rate: 0.50%
[ INFO] [1600044105.442910183]: comm time dt: 0.001163, freq: 860.123145, error rate: 1.20%
[ INFO] [1600044106.644414673]: comm time dt: 0.001202, freq: 832.290339, error rate: 0.80%
[ INFO] [1600044107.799858538]: comm time dt: 0.001155, freq: 865.461834, error rate: 1.20%
[ INFO] [1600044109.011259536]: comm time dt: 0.001211, freq: 825.497812, error rate: 1.60%

IMG_20200913_185701

codebot commented 4 years ago

Can you also post scope traces of the RS485 lines (A and B), in addition to the subtract operator? It seems odd that the motor-board RS485 is showing half the amplitude of the interface board.

Single-ended UART signal integrity shouldn't be an issue across a small PCB, regardless of reference plane (GND vs 3v3, etc). It's still low-frequency, only a few MHz.

Can you also probe the UART lines on the motor board? I'm wondering if a trailing "blip" gets triggered in the UART after the STOP pulse. I saw that earlier on my interface boards, but adding a bias network to the interface board RS485 lines (pulling A one direction, pulling B one direction) fixed it. It looks like the interface board in the photo doesn't have this mod; it requires a SMT resistor straddling U4 and another sneaking between two pins of U3. Do you have an interface board with those resistors added to test out?

gbalke commented 4 years ago

I don't have that mod on my interface board although I've bypassed the FTDI. I can attempt to create some form of biasing using ground and 5V on my dongle board. I'm not sure if this fully explains the issue though as I'd imagine that's something that would effect both controllers equally...

Here's a scope of RS485 with both channels included with the math operation. I agree that it's strange to see them reading only half amplitude... This doesn't explain why they would miss the host packet in the first place though (which is shown to happen above).

RS485_A_B_MATH

codebot commented 4 years ago

Ahh ok, this is starting to make sense. I believe that without the bias-resistor mod on the interface board, "sometimes" you get a super-fast blip at the end of a transmission. If you probe the UART RX line on a motor board (TP9), I think you'll see it "sometimes." In my testing, this "sometimes" causes a reception of an 0x00 or 0xff byte, which puts the packet parser in a weird state and makes it miss the next packet. Because it's a marginal thing, it doesn't happen all the time, and due to IC process variation, it affects some chips more than others. To provide some margin, we need the bias resistors on the interface board to pull the RS485 lines to a defined state when nobody is transmitting. This was my bad. It's the second bullet in the FIXME for the interface-board repo: https://github.com/BetzDrive/interface-board/blob/master/hardware/FIXME

I'll send a photo of the board mod required to do this. It's a bit tricky, but not totally crazy.

codebot commented 4 years ago

Two 486-ohm 0603 resistors are added in the photo below. The one between C24 and U4 is straightforward. The one on top of U4 is more tricky, but with a few solder blobs and some patience, it's not too bad. The goal of these resistors is to ensure that when nobody is driving the bus, the RS485 lines float apart enough to always ensure that A>B by enough margin to guarantee a "0" state to all RS485 receivers on the bus. The most important time for this is immediately after a transmitter stops transmitting, since the lines want to "bounce" a bit at that instant, which sometimes produces a trailing "blip" if these resistors aren't there to fight it.

IMG-2737

gbalke commented 4 years ago

I tried out this mod with some bread-boarded 560-ohm resistors. Still seeing appx the same error rate.

[ INFO] [1600048272.762296843]: comm time dt: 0.001238, freq: 807.928272, error rate: 1.00%
[ INFO] [1600048273.895032248]: comm time dt: 0.001133, freq: 882.810631, error rate: 0.90%
[ INFO] [1600048275.029416019]: comm time dt: 0.001134, freq: 881.551299, error rate: 0.30%
[ INFO] [1600048276.301062311]: comm time dt: 0.001272, freq: 786.370781, error rate: 0.90%
[ INFO] [1600048277.426637157]: comm time dt: 0.001126, freq: 888.428334, error rate: 0.50%
[ INFO] [1600048278.699957152]: comm time dt: 0.001273, freq: 785.350944, error rate: 0.70%
[ INFO] [1600048279.861565059]: comm time dt: 0.001162, freq: 860.872917, error rate: 0.60%
[ INFO] [1600048281.062153366]: comm time dt: 0.001201, freq: 832.929351, error rate: 1.20%
[ INFO] [1600048282.204798020]: comm time dt: 0.001143, freq: 875.159404, error rate: 0.30%
[ INFO] [1600048283.341738256]: comm time dt: 0.001137, freq: 879.557334, error rate: 0.70%
[ INFO] [1600048284.468888321]: comm time dt: 0.001127, freq: 887.192702, error rate: 1.10%
[ INFO] [1600048285.535561758]: comm time dt: 0.001067, freq: 937.495283, error rate: 0.60%
[ INFO] [1600048286.757202306]: comm time dt: 0.001222, freq: 818.569938, error rate: 1.10%

IMG_20200913_185309

B is pulled to GND and A is pulled to +5V.

codebot commented 4 years ago

OK. I'd suggest looking at the raw A and B lines again on the scope, trying to really dig into any weirdness. To confirm the end-of-TX issue is gone, try to zoom the scope so that you can observe any parasitics at the instant it stops transmitting. When I was debugging this, I found it helpful to have a script send single characters from the USB host, like the letter a at 10 Hz, so that it was easier to observe the issue. Otherwise you have to try to get the timing offset correct across a full packet, and it's a lot of scope zooming and it's hard to make it repeatable.

Alternatively, you could try to look at the UART RX test point on the motor board. If you're sending single-characters of the ASCII alphabet at 10 Hz or 100 Hz or whatever, you could write a mockup parser for the motor board that prints loud errors if it ever sees anything beyond A-Z.

The general idea is to be very very very very sure that the baseline RS485 channel is working at 100% reliability before trying to add any protocols on top of it, and now that gremlins have crept in, I'd suggest going back to single-character reliability checks and digging all the way down, before trying to look at end-to-end performance again.

gbalke commented 4 years ago

I've run a variety of tests to check performance. The first I did was write a single character out on a USB to RS485 adapter and read it back in through a USB to UART adapter after the on-board conversion. This proved to work very well and I had extremely low error rates. To increase the chance of error, I created a script to generate random 256 character byte strings which I then transmitted/received over the same setup. This resulted in, once again, quite low error rates (although more than 0%).

import serial                                                                     
import time                                                                       

import random                                                                     

def get_random_string(length):                                                    
    # put your letters in the following string                                    
    sample_letters = 'abcdefghijklmnopqrstuvwxyz'                                 
    result_str = ''.join((random.choice(sample_letters) for i in range(length)))  
    return result_str.encode('utf-8')                                             

if __name__ == '__main__':                                                        
    send_ser = serial.Serial(port='/dev/ttyUSB0', baudrate=1000000, timeout=0.004)
    recv_ser = serial.Serial(port='/dev/ttyUSB1', baudrate=1000000, timeout=0.004)

    count = 0                                                                     
    errors = 0                                                                    
    report_count = 1000                                                           
    while True:                                                                   
        send_msg = get_random_string(256)                                         
        send_ser.write(send_msg)                                                  
        recv_msg = recv_ser.read(len(send_msg))                                   

        count += 1                                                                
        if recv_msg != send_msg:                                                  
            errors += 1                                                           
            print('Errored on msg {}, got {}'.format(send_msg, recv_msg))         

        if count % report_count == 0:                                             
            print(                                                                
                'Sent {} random messages with {}% error rate.'.format(            
                    count, 1.0 * errors / count * 100                             
                )                                                                 
            )                                                                     

After some time, this resulted in:

Sent 81000 random messages with 0.008641975308641974% error rate.

119951942_1268317426856274_4722307079683863953_n

To confirm the issue is loss of data in RX, I adjusted my frequency testing script to record each type of error that occurs. I added a variety of test point errors to make this easier to track. All errors that occurred were purely from a receive timeout when searching for the sync flag save for a few rare errors of another variety.

[ INFO] [1600847880.961632216]: Comm time dt: 0.002165, freq: 461.873359, errors: 31, error rate: 3.10% 
[ INFO] [1600847880.961701804]: Expected 1 Byte(s) got 0, occurred 31 times                      
[ INFO] [1600847883.186686186]: Comm time dt: 0.002225, freq: 449.423258, errors: 41, error rate: 4.10% 
[ INFO] [1600847883.186746532]: Expected 1 Byte(s) got 0, occurred 41 times                                     
[ INFO] [1600847885.290739501]: Comm time dt: 0.002104, freq: 475.273400, errors: 22, error rate: 2.20% 
[ INFO] [1600847885.290795231]: Expected 1 Byte(s) got 0, occurred 22 times                                      
[ INFO] [1600847887.454893881]: Comm time dt: 0.002164, freq: 462.073911, errors: 33, error rate: 3.30% 
[ INFO] [1600847887.454972813]: Expected 1 Byte(s) got 0, occurred 33 times                    
[ INFO] [1600847889.597321340]: Comm time dt: 0.002142, freq: 466.761117, errors: 25, error rate: 2.50% 
[ INFO] [1600847889.597402923]: Expected 1 Byte(s) got 0, occurred 25 times                             

Why the STM32 is not picking up these messages is beyond me... As shown with the USB to UART adapter, we have very very low error rates of large messages. Keep in mind the maximum packet size we send with 8 boards is around ~~~150 bytes (completely guessing here, it's been a while). Based on this, we should expect lower error rates than with a 256 byte message. I honestly don't see what the difference is here. Something is causing an issue for the STM32's communication. I think I'll solder some wire directly to the legs where the UART hooks in and try again. Maybe it's an issue upstream from the testpoints?

@codebot any ideas? Also please feel free to ask for clarification. I'm writing this at 1 AM local haha.

Note: lighter used to burn off coating on wires. Apologies for awkward placement in photo.

gbalke commented 4 years ago

After testing a set of V2.1 boards, I've confirmed that they are not experiencing this issue. I realize now that an issue with the hardware is rather unlikely and so I've instead turned my attention towards the firmware. Though minor, the budgeting of CPU allocation has, in the past effected communication rate. To test this, I commented out all threads save for the communication thread and observed almost no missed packets. With this in mind, I can focus on this as a firmware issue instead of a hardware issue.

gbalke commented 4 years ago

I've found that commit https://github.com/BetzDrive/bldc-controller/pull/27/commits/18629af495c4920166dc24ebda9c619d19ce959e in https://github.com/BetzDrive/bldc-controller/pull/27 performs with very little to no errors. This confirms my suspicion that this is a firmware issue and I'll be closing this issue. I'll be debugging this on the firmware side and open a corresponding issue.

codebot commented 4 years ago

Interesting. I'll be really interested to hear how it goes as you continue to zero in on the problem. So the v2.2 boards seem to be innocent of causing this problem?

gbalke commented 4 years ago

Interesting. I'll be really interested to hear how it goes as you continue to zero in on the problem. So the v2.2 boards seem to be innocent of causing this problem?

Yeah, it's really funky. Please refer to the issue I just created! I'll be poking at it throughout the week.