LinuxCNC / linuxcnc

LinuxCNC controls CNC machines. It can drive milling machines, lathes, 3d printers, laser cutters, plasma cutters, robot arms, hexapods, and more.
http://linuxcnc.org/
GNU General Public License v2.0
1.78k stars 1.14k forks source link

sserial race bug #412

Closed pcw-mesa closed 6 years ago

pcw-mesa commented 6 years ago

Here are the steps I follow to reproduce the issue:

  1. Run LinuxCNC on a Ethernet FPGA card with a sserial remote
  2. Dump packets
  3. Packets show that on the write function the sserial DoIt command is written before all data registers have been written.

This is what I expected to happen:

DoIt issued after all sserial user data is written

This is what happened instead:

DoIt issued before all data registers are written

It worked properly before this:

Not sure, problem exists in 2.7 and 2.8, and may exist in PCI /EPP drivers as well

andypugh commented 6 years ago

I think I see the problem. But need to test to see if it is as simple as it looks.

rene-dev commented 6 years ago

I think I see what you see. That seems to be the problem. look at the dump, 0x1cc is written before 0x1d4 I think swapping those lines will fix it.

CS: 0x1cc
U0: 0x1d4
U1: 0x1dc
U2: 0x1e4
################
start of packet len 60
lbp16 cmd: 0xc282
write
addr: 0x1ec len: 8
lbp16 cmd: 0xc281
write
addr: 0x1bc len: 4
lbp16 cmd: 0xc281
write
addr: 0x1cc len: 4
lbp16 cmd: 0xc281
write
addr: 0x1d4 len: 4
lbp16 cmd: 0xc281
write
addr: 0x1d0 len: 4
lbp16 cmd: 0xc281
write
addr: 0x1d8 len: 4
lbp16 cmd: 0xd182
write
addr: 0x14 len: 4
end of packet
################
start of packet len 44
lbp16 cmd: 0x4282
read
addr: 0x1ec len: 8
lbp16 cmd: 0x4281
read
addr: 0x1bc len: 4
lbp16 cmd: 0x4281
read
addr: 0x1c4 len: 4
lbp16 cmd: 0x4281
read
addr: 0x1cc len: 4
lbp16 cmd: 0x4281
read
addr: 0x1d4 len: 4
lbp16 cmd: 0x4281
read
addr: 0x1d0 len: 4
lbp16 cmd: 0x4281
read
addr: 0x1d8 len: 4
lbp16 cmd: 0x5901
read
addr: 0x8 len: 2
lbp16 cmd: 0xd182
write
addr: 0x10 len: 4
lbp16 cmd: 0x5184
read
addr: 0x10 len: 8
sending 42 bytes
end of packet
################
rmu75 commented 6 years ago

with this patch, sserial via hm2_rpspi stops working.

andypugh commented 6 years ago

I will have to see if I can find a Pi and a 7i90 then. Can you suggest a simple test that demonstrates that it is broken?

rmu75 commented 6 years ago

Sorry I haven't been more specific. hostmot2 errors out with "DoIt not cleared". My "simple" scenario from the actual machine config:

> halrun
halcmd: loadrt trivkins
Note: Using POSIX realtime
halcmd: loadrt motmod servo_period_nsec=1000000 num_joints=3
halcmd: loadrt hostmot2
hm2: loading Mesa HostMot2 driver version 0.15
halcmd: loadrt hm2_rpspi spiclk_rate=25000
hm2/hm2_7i90.0: Smart Serial Firmware Version 43
hm2/hm2_7i90.0: 72 I/O Pins used:
hm2/hm2_7i90.0:     IO Pin 000 (P1-01): PWMGen #0, pin Not-Enable (Output)
hm2/hm2_7i90.0:     IO Pin 001 (P1-03): Muxed Encoder #0, pin Muxed A (Input)
hm2/hm2_7i90.0:     IO Pin 002 (P1-05): Muxed Encoder #0, pin Muxed B (Input)
hm2/hm2_7i90.0:     IO Pin 003 (P1-07): Muxed Encoder #0, pin Muxed Index (Input)
hm2/hm2_7i90.0:     IO Pin 004 (P1-09): Muxed Encoder #1, pin Muxed A (Input)
hm2/hm2_7i90.0:     IO Pin 005 (P1-11): Muxed Encoder #1, pin Muxed B (Input)
hm2/hm2_7i90.0:     IO Pin 006 (P1-13): Muxed Encoder #1, pin Muxed Index (Input)
hm2/hm2_7i90.0:     IO Pin 007 (P1-15): Muxed Encoder #2, pin Muxed A (Input)
hm2/hm2_7i90.0:     IO Pin 008 (P1-17): Muxed Encoder #2, pin Muxed B (Input)
hm2/hm2_7i90.0:     IO Pin 009 (P1-19): Muxed Encoder #2, pin Muxed Index (Input)
hm2/hm2_7i90.0:     IO Pin 010 (P1-21): Muxed Encoder Select #0, pin Mux Select 0 (Output)
hm2/hm2_7i90.0:     IO Pin 011 (P1-23): PWMGen #0, pin Out0 (PWM or Up) (Output)
hm2/hm2_7i90.0:     IO Pin 012 (P1-25): PWMGen #0, pin Out1 (Dir or Down) (Output)
hm2/hm2_7i90.0:     IO Pin 013 (P1-27): PWMGen #1, pin Out0 (PWM or Up) (Output)
hm2/hm2_7i90.0:     IO Pin 014 (P1-29): PWMGen #1, pin Out1 (Dir or Down) (Output)
hm2/hm2_7i90.0:     IO Pin 015 (P1-31): PWMGen #2, pin Out0 (PWM or Up) (Output)
hm2/hm2_7i90.0:     IO Pin 016 (P1-33): PWMGen #2, pin Out1 (Dir or Down) (Output)
hm2/hm2_7i90.0:     IO Pin 017 (P1-35): PWMGen #3, pin Out0 (PWM or Up) (Output)
hm2/hm2_7i90.0:     IO Pin 018 (P1-37): PWMGen #3, pin Out1 (Dir or Down) (Output)
hm2/hm2_7i90.0:     IO Pin 019 (P1-39): PWMGen #4, pin Out0 (PWM or Up) (Output)
hm2/hm2_7i90.0:     IO Pin 020 (P1-41): PWMGen #4, pin Out1 (Dir or Down) (Output)
hm2/hm2_7i90.0:     IO Pin 021 (P1-43): PWMGen #5, pin Out0 (PWM or Up) (Output)
hm2/hm2_7i90.0:     IO Pin 022 (P1-45): PWMGen #5, pin Out1 (Dir or Down) (Output)
hm2/hm2_7i90.0:     IO Pin 023 (P1-47): PWMGen #0, pin Not-Enable (Output)
hm2/hm2_7i90.0:     IO Pin 024 (P2-01): IOPort
hm2/hm2_7i90.0:     IO Pin 025 (P2-03): IOPort
hm2/hm2_7i90.0:     IO Pin 026 (P2-05): IOPort
hm2/hm2_7i90.0:     IO Pin 027 (P2-07): IOPort
hm2/hm2_7i90.0:     IO Pin 028 (P2-09): IOPort
hm2/hm2_7i90.0:     IO Pin 029 (P2-11): IOPort
hm2/hm2_7i90.0:     IO Pin 030 (P2-13): IOPort
hm2/hm2_7i90.0:     IO Pin 031 (P2-15): IOPort
hm2/hm2_7i90.0:     IO Pin 032 (P2-17): IOPort
hm2/hm2_7i90.0:     IO Pin 033 (P2-19): IOPort
hm2/hm2_7i90.0:     IO Pin 034 (P2-21): IOPort
hm2/hm2_7i90.0:     IO Pin 035 (P2-23): IOPort
hm2/hm2_7i90.0:     IO Pin 036 (P2-25): IOPort
hm2/hm2_7i90.0:     IO Pin 037 (P2-27): IOPort
hm2/hm2_7i90.0:     IO Pin 038 (P2-29): IOPort
hm2/hm2_7i90.0:     IO Pin 039 (P2-31): IOPort
hm2/hm2_7i90.0:     IO Pin 040 (P2-33): IOPort
hm2/hm2_7i90.0:     IO Pin 041 (P2-35): IOPort
hm2/hm2_7i90.0:     IO Pin 042 (P2-37): IOPort
hm2/hm2_7i90.0:     IO Pin 043 (P2-39): IOPort
hm2/hm2_7i90.0:     IO Pin 044 (P2-41): IOPort
hm2/hm2_7i90.0:     IO Pin 045 (P2-43): IOPort
hm2/hm2_7i90.0:     IO Pin 046 (P2-45): IOPort
hm2/hm2_7i90.0:     IO Pin 047 (P2-47): IOPort
hm2/hm2_7i90.0:     IO Pin 048 (P3-01): Smart Serial Interface #0, pin RxData0 (Input)
hm2/hm2_7i90.0:     IO Pin 049 (P3-03): Smart Serial Interface #0, pin RxData1 (Input)
hm2/hm2_7i90.0:     IO Pin 050 (P3-05): Smart Serial Interface #0, pin RxData2 (Input)
hm2/hm2_7i90.0:     IO Pin 051 (P3-07): IOPort
hm2/hm2_7i90.0:     IO Pin 052 (P3-09): Smart Serial Interface #0, pin TxData0 (Output)
hm2/hm2_7i90.0:     IO Pin 053 (P3-11): Smart Serial Interface #0, pin TxEn0   (Output)
hm2/hm2_7i90.0:     IO Pin 054 (P3-13): Smart Serial Interface #0, pin TxData1 (Output)
hm2/hm2_7i90.0:     IO Pin 055 (P3-15): Smart Serial Interface #0, pin TxEn1   (Output)
hm2/hm2_7i90.0:     IO Pin 056 (P3-17): Smart Serial Interface #0, pin TxData2 (Output)
hm2/hm2_7i90.0:     IO Pin 057 (P3-19): Smart Serial Interface #0, pin TxEn2   (Output)
hm2/hm2_7i90.0:     IO Pin 058 (P3-21): IOPort
hm2/hm2_7i90.0:     IO Pin 059 (P3-23): IOPort
hm2/hm2_7i90.0:     IO Pin 060 (P3-25): IOPort
hm2/hm2_7i90.0:     IO Pin 061 (P3-27): IOPort
hm2/hm2_7i90.0:     IO Pin 062 (P3-29): IOPort
hm2/hm2_7i90.0:     IO Pin 063 (P3-31): IOPort
hm2/hm2_7i90.0:     IO Pin 064 (P3-33): IOPort
hm2/hm2_7i90.0:     IO Pin 065 (P3-35): IOPort
hm2/hm2_7i90.0:     IO Pin 066 (P3-37): IOPort
hm2/hm2_7i90.0:     IO Pin 067 (P3-39): IOPort
hm2/hm2_7i90.0:     IO Pin 068 (P3-41): IOPort
hm2/hm2_7i90.0:     IO Pin 069 (P3-43): IOPort
hm2/hm2_7i90.0:     IO Pin 070 (P3-45): IOPort
hm2/hm2_7i90.0:     IO Pin 071 (P3-47): IOPort
hm2/hm2_7i90.0: registered
halcmd: loadrt classicladder_rt  numRungs=200 numTimersIec=20 numPhysInputs=100 numPhysOutputs=100 numSections=10
creating ladder-state
halcmd: addf hm2_7i90.0.read          servo-thread
halcmd: addf motion-command-handler   servo-thread
halcmd: addf motion-controller        servo-thread
halcmd: addf pid.x.do-pid-calcs       servo-thread
halcmd: addf pid.y.do-pid-calcs       servo-thread
halcmd: addf pid.z.do-pid-calcs       servo-thread
halcmd: addf pid.s.do-pid-calcs       servo-thread
halcmd: addf classicladder.0.refresh  servo-thread
halcmd: addf hm2_7i90.0.write         servo-thread
halcmd: start
halcmd: hm2/hm2_7i90.0: Smart Serial port 0: DoIt not cleared from previous servo thread. Servo thread rate probably too fast. This message will not be repeated, but the hm2_7i90.0.sserial.0.fault-count pin will indicate if this is happening frequently.
hm2/hm2_7i90.0: Smart Serial Comms Error: There have been more than 1 errors in 10 thread executions at least 200 times. See other error messages for details.
hm2/hm2_7i90.0: ***Smart Serial Port 0 will be stopped***
halcmd:

The message "DoIt not cleared" appears instantly. Reverting this patch seems to fix this.

andypugh commented 6 years ago

Is this a precompiled RPi image, or did you compile it yourself? (Just trying to save some effort on creating a system to replicate your problem)

pcw-mesa commented 6 years ago

Have you tried running the servo thread slower? the race fix moves the DoIt to after the outgoing data is written to the user registers, this is needed to avoid a race condition, but it does reduce the thread timing margins. Another thing to try is to turn on debugging and see if the DoIt command is actually issued as this change may have exposed a RSPI bug

rmu75 commented 6 years ago

This is a stock raspbian 8 image with kernel version 4.9.65 with raspberry pi patch and rt-preempt rt56-v7 patch, cross compiled. Hardware is a Raspberry Pi 3 with cmdline 8250.nr_uarts=0 bcm2708_fb.fbwidth=1600 bcm2708_fb.fbheight=900 bcm2708_fb.fbswap=1 vc_mem.mem_base=0x3dc00000 vc_mem.mem_size=0x3f000000 sdhci_bcm2708.enable_llm=0 dwc_otg.lpm_enable=0 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait isolcpus=3

I am running with RTAPI_CPU_NUMBER=3

I will try with slower servo-thread, higher SPI frequency and try enabling debug and report back.

andypugh commented 6 years ago

It is possible that the hm2 read and hm2 write function ordering in the HAL file might matter too.

rmu75 commented 6 years ago

increasing servo period to 2000000 does not change anything. running with spi_debug=4 shows

INFO----REALTIME allocations for classicladder:
Sizes: rungs- 200 bits- 20 words- 20 timers- 10 mono- 10 count- 10 IEC timers- 20
 HAL Bin- 100 HAL Bout- 100 expressions- 100 sections- 10 symbols - 200
  s32in - 10 s32out- 10 Error bits-10
halcmd: addf hm2_7i90.0.read          servo-thread
halcmd: addf motion-command-handler   servo-thread
halcmd: addf motion-controller        servo-thread
halcmd: addf classicladder.0.refresh  servo-thread
halcmd: addf hm2_7i90.0.write         servo-thread
halcmd: start
halcmd: MOTION: setting Traj cycle time to 2000000 nsecs
MOTION: setting Servo cycle time to 2000000 nsecs
hm2/hm2_7i90.0: Num Auto = 3
hm2/hm2_7i90.0: Enabled Remotes tag = = 7
hm2/hm2_7i90.0: Smart Serial port 0: DoIt not cleared from previous servo thread. Servo thread rate probably too fast. This message will not be repeated, but the hm2_7i90.0.sserial.0.fault-count pin will indicate if this is happening frequently.
hm2/hm2_7i90.0: Smart Serial Comms Error: There have been more than 1 errors in 10 thread executions at least 200 times. See other error messages for details.
hm2/hm2_7i90.0: ***Smart Serial Port 0 will be stopped***
rmu75 commented 6 years ago

It is possible that the hm2 read and hm2 write function ordering in the HAL file might matter too.

reordering read and write doesn't make any difference, neither does removing the other calls.

rmu75 commented 6 years ago

rebuilt hm2_rpspi with RPSPI_DEBUG. halrunning this

loadrt trivkins
loadrt motmod servo_period_nsec=2000000 num_joints=3
loadrt hostmot2
loadrt hm2_rpspi spiclk_rate=25000 spi_debug=5
loadrt classicladder_rt  numRungs=200 numTimersIec=20 numPhysInputs=100 numPhysOutputs=100 numSections=10
addf hm2_7i90.0.read          servo-thread
addf motion-command-handler   servo-thread
addf motion-controller        servo-thread
addf classicladder.0.refresh  servo-thread
addf hm2_7i90.0.write         servo-thread
start

yields debug.log on stderr.

andypugh commented 6 years ago

The debug log isn't adding much info. It seems that the DoIt fault isn't causing the system to stop, what is causing the problem is a comms error. It might be interesting to mask that fault (set the dec > inc in the sserial params. See fault-inc, fault-dec and fault-lim here: http://linuxcnc.org/docs/2.7/html/man/man9/sserial.9.html I am not suggesting this as a solution, just wondering how bad the problem is.

pcw-mesa commented 6 years ago

Yeah you probably need a log of all HM2 read and writes (data and addresses). I thought there was a way to do this but I dont see it now

jasenk2 commented 6 years ago

Hi all, we have similar problem kernel is not full preemptive. That's all folks.

2018-03-29 22:14 GMT+03:00 andypugh notifications@github.com:

The debug log isn't adding much info. It seems that the DoIt fault isn't causing the system to stop, what is causing the problem is a comms error. It might be interesting to mask that fault (set the dec > inc in the sserial params. See fault-inc, fault-dec and fault-lim here: http://linuxcnc.org/docs/2.7/html/man/man9/sserial.9.html I am not suggesting this as a solution, just wondering how bad the problem is.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/LinuxCNC/linuxcnc/issues/412#issuecomment-377342362, or mute the thread https://github.com/notifications/unsubscribe-auth/ACATkQMw3X_YAMKwotqSo8AyefOl0noZks5tjTKQgaJpZM4SXfxz .

andypugh commented 6 years ago

What bitfile are you using in the 7i92? (Just in case it is bitfile-specific)

rmu75 commented 6 years ago

The debug log isn't adding much info. It seems that the DoIt fault isn't causing the system to stop, what is causing the problem is a comms error.

I don't think communication per se does not work, as the whole thing is working when i back out the patch, i.e. I/O via the sserial-connected 7i90s is possible and working.

The sserial slaves do have stock bitfile, did not modify anything on them. Not sure about the bitfile in the spi-connected 7i90, I had to replace the stock firmware as that did not work. IIRC I built a bitfile, but ultimately flashed an updated version. Unfortunately the machine that I used to build the bitfile and to flash the card is gone, so I can't have a look now (will have to dig out some backups).

rmu75 commented 6 years ago

Yeah you probably need a log of all HM2 read and writes (data and addresses). I thought there was a way to do this but I dont see it now

In theory I could attach an oscilloscope or a logic analyzer, but that would take some time.

My current theory is that hm2_rpspi possibly truncates the transfer, so the DoIt doesn't make it when it is the last thing to be transferred.

rmu75 commented 6 years ago

I hm2_register_tram_write_region-ed a dummy transfer after the DoIt, problem gone. So either hm2_rpspi is doing something strange (truncating transfer) or I have SPI issues or something else.

Will try to hook up oscilloscope tomorrow.

andypugh commented 6 years ago

It has to be hm2_rpspi as all the code above the llio driver is shared. There are two spi_write functions, I am not sure which one is used under which circumstances. But I would be very interested to see what happens if this is changed to a post-decrement: https://github.com/LinuxCNC/linuxcnc/blob/master/src/hal/drivers/mesa-hostmot2/hm2_rpspi.c#L389

rmu75 commented 6 years ago

Here, it doesn't change anything. I'm using SPI0. Comment suggests that hm2_rpspi_write_spi0 is used for SPI0 and ..._spi1 for SPI1.

Is there some register that is safe to dummy-write on the 7i90?

rmu75 commented 6 years ago

I logged read/write calls to hm2_rpspi_read|write_spi0 (see attached file), format is address, size and first word of buffer. debug2.log

andypugh commented 6 years ago

Thanks for being so willing to help diagnose the problem. Hopefully PCW can work out if that log looks right.

pcw-mesa commented 6 years ago

It is suspicious that moving the sserial command write to the end of the write function causes problems. But I dont think the command is truncated since the DoIt command write does seem to succeed: 4973 hm2_rpspi: hm2_rpspi_write_spi0: 5a00 4: 1007 because when the sserial command register is read back it contains 0x1007: 4979 hm2_rpspi: hm2_rpspi_read_spi0: 5a00 4: 1007 You would expect sserial errors if the normal read/process/write/wait order were not followed and I'm assuming that the hal file was returned to normal if any changes to this order were done So it sort of looks like a thread order issue to me (not enough delay between write and read) (there should be more than 1 ms between the command register write and read with the 2 ms servo thread)

andypugh commented 6 years ago

Is it possible that the SPI hardware is disabled a bit too soon, and does not have time to finish sending the data? I am looking at the spi0_reset call here: https://github.com/LinuxCNC/linuxcnc/blob/master/src/hal/drivers/mesa-hostmot2/hm2_rpspi.c#L673

rmu75 commented 6 years ago

I didn't have time to hook up the oscilloscope to the machine today, will try to replicate the setup in the lab and check for SPI transfer trunaction issues.

Is it possible that the SPI hardware is disabled a bit too soon, and does not have time to finish sending the data? I am looking at the spi0_reset call here:

That would be weird. Each line in the logfile is a call to hm2_rpspi_read_spi0 resp. write, spi0_reset is called every time.

andypugh commented 6 years ago

I am unclear why spi_reset needs to be called at all. And I am wondering if, possibly, it is called before the SPI buffer is clear. But I don't even know if there is an SPI buffer.

rmu75 commented 6 years ago

It is more or less the same code as in the driver of the linux kernel, see https://github.com/torvalds/linux/blob/master/drivers/spi/spi-bcm2835.c#L129

I just replaced the body of spi0_reset with a return, no changes. Works if I transfer dummy after DoIt, does not work if DoIt is last write.

pcw-mesa commented 6 years ago

The fact that the read of the sserial command register at 0x5A00 returns the last data written suggests that the write took place but perhaps the thread order is not as it should be (a pre-existing problem masked by the old code writing the sserial command register early).

andypugh commented 6 years ago

The thread order in the HAL file looked correct. I would have considered the possibility that the thread time was such that read followed close upon write, except that the report was that removing functions from threads made no difference.

pcw-mesa commented 6 years ago

Thats true but that is assuming the function invocations are doing what they say they are...

BsAtHome commented 6 years ago

Hi all,

I'm the author who changed all that RPI SPI stuff into its current form. I'll try to help if I can.

The SPI is operated synchronously as seen from the CPU (busy-wait operation). A read operation needs to write as many words as need to be read (will send the 32bit command followed by zeroes). A write operation is actually a transfer (write/read) operation sending the 32bit command and then sending the data. The data read in a write operation is discarded. Each word written results in a word read. Therefore, when the last word is read, then it is ensured that the whole SPI operation is finished and the bus will be reset (CS deassertion is the only thing you see externally from this operation).

The SPI channels have fifos, which are used. This improves transfer-rates to keep the SPI-bus occupied if the CPU is interrupted by another task by allowing TX/RX operation to continue while the CPU is busy elsewhere.

There are some small differences between ports SPI0 and SPI1. SPI0 is operated in 8-bit mode whereas SPI1 operates in 16-bit mode. This difference has no consequence for the actual operation on the bus, other than that the hardware may choose to insert a small dead-time between words. My tests show that this has no real consequences.

There are several issues that I have seen in the operation. The write-speed may be higher than the read-speed. This is due to the round-trip of the data in a read-operation, where you will lose synchronization between clock and data if you try to do it too fast. A write operation will discard any incoming data, so there is no issue there. You should always start slow and work your way up until it fails...

The RPI SPI bus is not buffered. This means that long cables are a no-go due to cable-capacitance. You also must ensure to do impedance matching to prevent reflections, meaning a (good and) small series resistor (usually about 75..120 Ohm) on the outgoing lines and it must be placed as close to the RPI 40-pin header as possible. If you implement physical (very) high-speed TX/RX buffers, then you will suffer in read-speed because you introduce at least 2 times 2..4ns delay on top of the round-trip of the rest of the hardware setup.

The 7i90 is where the rest of the magic happens. From what I read in this thread, there may be a timing issue in the 7i90 at these high speeds (maybe due to the slave's internal statemachine). From the RPI's perspective, there may be read/write operations that follow very fast. If the 7i90 requires time to process or to recover from CS deassertion, then there may be a problem. Also, CS assertion and deassertion timing is determined by the RPI hardware and assumes that the 7i90 will process the data properly.

If there is a data-transfer timing issue in the 7i90, then this should be discoverable by reducing the SPI clock significantly (lets say 500kHz or 1MHz). However, this will not change the CS deassertion timing (hardware determined).

As I read the thread, there is a problem in reading the 7i90. If the 7i90 can tolerate a read beyond the internal buffer, then you can add an extra word to be read, just to see if that changes the timing.

-- Greetings Bertho

pcw-mesa commented 6 years ago

The 7I90 SPI interface requires no time between cycles (all internal read,write operations are overlapped with the serial data transfer) CS deassertion between commands is not typically required unless the 7I90 internal state is unknown. Short CS deassertions (~10 ns to 50 usec) will clear the data bit counter, long CS deassertions (>50 usec) will do a complete 7I90 SPI state machine reset, cancelling any multiword transfers in process

BsAtHome commented 6 years ago

The SPI driver guarantees that all transfers are completed before CS is deasserted. Therefore, a statemachine reset can only occur if no transfer is active.

So, since it is the /last/ instruction in a sequence that causes problems, there is one other possibility. It has been said over email that appending a NOP in the transfer after the Do-It instruction makes things work. The added command word, leading the transfer, includes the length of the transfer. This length field is the length of the data /following/ the command-word. However, if the last command is ignored, should the command-word include the command-word itself in the length field?

rmu75 commented 6 years ago

The fact that the read of the sserial command register at 0x5A00 returns the last data written suggests that the write took place but perhaps the thread order is not as it should be (a pre-existing problem masked by the old code writing the sserial command register early).

A dummy transfer after DoIt makes things work, I think that rules out any ordering issues.

I don't really understand the VHDL code, but perhaps there is an issue with triggering the DoIt in the FPGA being done on somehow on the wrong clock? So that is it triggered by the start of the next command?

pcw-mesa commented 6 years ago

No, because the proper data is in the sserial command register

pcw-mesa commented 6 years ago

I don't really understand the VHDL code, but perhaps there is an issue with triggering the DoIt in the FPGA being done on somehow on the wrong clock? So that is it triggered by the start of the next command?

Actually that would have this type of symptom (writes not completed till the next SPI cycle) I'll test this in the hm2spi hardware sometime this coming week

pcw-mesa commented 6 years ago

rmu75 was correct, there was a bug in the SPI FPGA firmware. That will teach me to test by looping back data (since a read of loopback data following a write will work because the read finishes the write before it reads back the updated data) Simplest current workaound is to issue a dummy read after the last write (say read the hm2 cookie) I have fixed the FPGA source and am currently rebuilding all the 7i90_spi bitfiles...

andypugh commented 6 years ago

So, how do we fix this? Should we issue an error message if card=7i90 and firmware version < a number?

Or is adding a dummy write for every card the way to go? (It doesn't sound ideal)

pcw-mesa commented 6 years ago

A dummy read will work also. Probably best to just update firmware, I dont think there are very many users of hm2_rspi

andypugh commented 6 years ago

I will need to see if the sserial driver can tell that it is driving a 7i90. I would much prefer not to add a dummy anything that affects all the other drivers.

rene-dev commented 6 years ago

the sserial driver probably cant, I think that is a job for the spi driver to check the firmware version...

pcw-mesa commented 6 years ago

I would not do this at all and have the very few people using hm2_rspi update their 7i90 firmware

pcw-mesa commented 6 years ago

All 7I90_SPI bitfiles and VHDL source have been updated to fix this issue

rmu75 commented 6 years ago

I would not do this at all and have the very few people using hm2_rspi update their 7i90 firmware

I second this, especially as firmware flashing is really easy via spi.

All 7I90_SPI bitfiles and VHDL source have been updated to fix this issue

thanks, will try ASAP.