MEGA65 / mega65-core

MEGA65 FPGA core
Other
237 stars 84 forks source link

SDRAM controller intermittant problems #802

Open gardners opened 1 month ago

gardners commented 1 month ago

SDRAM fails with test program sometimes. Suspect that command to set SDRAM latencies doesn't always work, so it has latency set one cycle wrong, and timing requirements are then not met at full speed.

Attached test program fails sometimes attictest.prg.gz

Failing test runs can look like this: mega65-screen-000013

gardners commented 1 month ago

The two failing outputs are either:

  1. a shift by 2 bytes, which since the SDRAM is DDR means 1 cycle. This is the latency difference between the default and requested mode of operation.
  2. First byte of an 8-byte block is wrong. This is likely when the SDRAM is not yet set to the higher clock speed, but is driven at the higher speed, and thus the first byte isn't ready yet (first byte has tighter timing than the subsequent bytes, if I remember the data sheet correctly):

mega65-screen-000129

Another form the failure can take is that some bits of first byte in a row of 8 are stuck high or low. Again, this looks like a variation of (2) above. Actually, it is reading the image of the same bit from another byte, in some cases byte 2 of the row of 8, suggesting that it is reading a bit too late in the DDR if the SDRAM has been put into the correct latency mode (which can't be assumed).

In terms of things that trigger changes between PASS and FAIL behaviour:

  1. Soft resets don't change the status of passing or failing: 10 in a row failing case. 10 in a row passing case.
  2. Power cycles can toggle pass/fail status.
  3. Pushing a bitstream via JTAG can toggle status.

The above is consistent with the SDRAM controller attempting to set the latency mode on bitstream start up, but not on reset button press. This makes sense, because SDRAM controller does not have a /RESET input. Maybe it should?

gardners commented 1 month ago

Trying software induced reset of SDRAM chip by writing to $C000000 doesn't work (although SDRAM controller does report that it is trying to reset when this occurs).

gardners commented 1 month ago

Note that the "slow mode" for the SDRAM controller only changes the read latency -- it doesn't change the SDRAM clock from 162MHz to 81MHz -- which is why if the Mode Register Set operation fails, that the test continues to fail. In principle, if we dropped the clock speed of the SDRAM in half as well, it should just magically work.

gardners commented 1 month ago

SDRAM is directly clocked from 162MHz clock via an ODDR primitive, not that it should matter.

gardners commented 1 month ago

Things that might be worth trying:

  1. Send Mode Register Set command multiple times in a row, to make sure there are no set up time problems with the bits of the command.
  2. Allow switching between latching read data on +ve or -ve edge of 162MHz clock.
  3. Check that we output signal updates on falling edge of 162MHz clock (or that the ODDR is inverting it), so that setup and hold times for the SDRAM lines are met.
  4. Try adjusting phase of SDRAM clock (clock162m) from 207 degrees to something else. Ideally make it run-time adjustable. Actually its currently 180 degrees according to clocking.vhdl. This will require enabling MMCM_ADV dynamic reconfiguration: See https://docs.amd.com/v/u/en-US/ug362.
gardners commented 1 month ago

Commit above should allow (2).

gardners commented 1 month ago

Goals for tonight: (1) Test the above. (2) Drop clock from 162 to 81MHz and see if that fixes it, if so, check whether it is writing or reading that is failing.

gardners commented 1 month ago

With the above we have 8 settings that can be explored. With the SDRAM controller in the "failure" mode, setting $01 seems to allow correct writing, while $02 seems to allow correct reading. Those two settings switch the clock phase on the 162MHz clock connected to the SDRAM controller.

This points to an issue during writing in terms of the underlying problem.

(Note that if the problem is with write timing latching, this would explain why writing to the configuration register is not always working.)

Now trying to get it into the passing state, to see how these affect it in that situation.

Ok, in the passing state, reading works fine with $02, and writing with $05. Reading with $05 in the passing state works, but with an extra cycle of delay on the read data.

Common theme: Switching the clock phase between reading and writing helps. Now, what's weird is that bit 2 of this value should switch whether we latch reads on rising or falling edge of the clock, so we shouldn't need to switch the clock phase... but it doesn't. So there is some other thing going on here.

Anyway, I'm adding the ability to add an extra cycle of latency to the read side, so that in the passing state it should be possible to have a single setting that works for both read and write. If that works, then I can look at having a flag that allows swapping clock phase between read and write. With those two things, it should be possible to have the SDRAM work correctly in both states. This would then just require a bit of boot-strap code that determines the correct timing setting on boot.

I'd still like to figure out why it doesn't reliably configure into one or the other state, however.

gardners commented 1 month ago

The above commits should allow automatic inversion of the clock when writing via bit 4 of $C000000. But first to test the previous commit for adding the extra read latency when required.

gardners commented 1 month ago

Using the above new bitstream, I wrote the following program to detect which configurations in $C000000 work:

0 G=-1                                                                           
 10 C=$C000000                                                                    
 20 S=$8000000                                                                    
 30 FORCC=0TO63:POKE C,CC                                                         
 50 FORI=0TO15:POKES+I,I:NEXTI                                                    
 60 REM  FORI=0TO15:PRINTRIGHT$(HEX$(PEEK(S+I)),2);" ";:NEXTI                     
 70 REM PRINT                                                                     
 80 FORI=0TO15:IFPEEK(S+I)=ITHENNEXTI                                             
 90 IFI<16THEN GOTO 999                                                           
 100 FORI=0TO15:POKES+I,128+I:NEXTI                                               
 110 FORI=0TO15:IFPEEK(S+I)=128+ITHENNEXTI                                        
 120 IFI<16THEN GOTO 999                                                          
 130 PRINTHEX$(CC)                                                                
 140 IFG=-1THENG=CC                                                               
 999 NEXTCC                                                                       
 1000 POKEC,G                                                                     
 1010 PRINT"USING CONFIG ";HEX$(G)

This suggests that several settings work, at least on this boot (not sure if it is a fail or pass mode case).

gardners commented 1 month ago

I've then had to patch attictest.prg, as to bust the cache it writes to ADDR + something, which means for the end part of the test, it writes to the $Cxxxxxx where the SDRAM controller is treating any write as setting the configuration. I should also make the config writes only work at a specific address, e.g., $EFFFFFF, to minimise this kind of error.

gardners commented 1 month ago

It looks that previous run was with it in the "pass" state. With it in the fail state, no combination passes the 2nd half of the test, where we write different values over the 16 byte range to check.

Okay.... I just somehow flipped the "fail" mode to "good" mode. Poked $D7FE to $10, $30, $00 and $10. But not sure if it is that sequence that did it or not. Anyway, this needs further investigation to figure out what is going on.

gardners commented 1 month ago

Reproduced by poking $D7FE,$00 and then $D7FE,$10

@lydon42 's sdram test program then passes.

The only other change here is that I poke $C000000 twice. But that shouldn't make a difference.

Ah, okay, so actually it is that $D7FE must have $10, not $30. Just that single POKE $D7FE,$10 before running the SDRAM settings program is enough to make it work in the "fail" state now, it seems.

gardners commented 1 month ago

Now trying loading the bitstream many times, running @lydon42 's SDRAM test, and if it fails, running my updated SDRAM settings program that does $D7FE,$10 first, to see if we get reliable SDRAM passing.

sdram_settings3.prg:

0 G=-1                                                                           
 5 POKE$D7FE,$10                                                                  
 10 C=$C000000                                                                    
 20 S=$8000000                                                                    
 30 FORCC=0TO63:POKE C,CC                                                         
 50 FORI=0TO15:POKES+I,I:NEXTI                                                    
 60 REM  FORI=0TO15:PRINTRIGHT$(HEX$(PEEK(S+I)),2);" ";:NEXTI                     
 70 REM PRINT                                                                     
 80 FORI=0TO15:IFPEEK(S+I)=ITHENNEXTI                                             
 90 IFI<16THEN GOTO 999                                                           
 100 FORI=0TO15:POKES+I,128+I:NEXTI                                               
 110 FORI=0TO15:IFPEEK(S+I)=128+ITHENNEXTI                                        
 120 IFI<16THEN GOTO 999                                                          
 130 PRINTHEX$(CC)                                                                
 140 IFG=-1THENG=CC                                                               
 999 NEXTCC                                                                       
 1000 POKEC,G                                                                     
 1010 PRINT"USING CONFIG ";HEX$(G)                

Patched attictest.prg:

  1. Added line 2117 BEND
  2. Added line 3015 IF B>=$C000000THENB=B-$4000000

Run 1 - Was in FAIL state, but running sdram_settings3.prg fixed it. Used $C000000 value $09 (but many others (more than the 4 in run 2) were indicated as likely good. I wasn't expecting changes in the list of good configs, so didn't write them down). Run 2 - Was in FAIL state, but running sdram_settings3.prg fixed it. Used $C000000 value $09 (but three others were also indicated as likely to be good). Run 3 - Was in FAIL state, Configs $09, $1A, $29 and $3A indicated as likely good. Config $09 worked. Run 4 - Was in FAIL state, Configs $09, $1A, $29 and $3A indicated as likely good. Config $09 worked. Run 5 - Was in FAIL state, Configs $09, $0D, $1A, $1E, $29, $2D, $3A and $3E indicated as likely good. Config $09 worked. Also tried $3E for fun, and that worked, too.

gardners commented 1 month ago

Above commit makes config $09 default. While that synthesises, continuing to test with previous bitstream:

Run 6 - Was in FAIL state, Configs $09, $0D, $1A, $1E, $29, $2D, $3A and $3E indicated as likely good. Config $09 worked. Run 7 - Was in FAIL state, Configs $09, $0D, $1A, $1E, $29, $2D, $3A and $3E indicated as likely good. Config $09 worked. Run 8 - Was in FAIL state, Configs $09, $0D, $1A, $1E, $29, $2D, $3A and $3E indicated as likely good. Config $09 worked. Run 9 - Was in FAIL state, Configs $09, $0D, $1A, $1E, $29, $2D, $3A and $3E indicated as likely good. Config $09 worked. Run 10 - Was in FAIL state, Configs $09, $1A, $29 and $3A indicated as likely good. Config $09 worked. Run 11 - Was in FAIL state, Configs $09, $0D, $1A, $1E, $29, $2D, $3A and $3E indicated as likely good. Config $09 worked. Run 12 - Was in FAIL state, Configs $09, $1A, $29 and $3A indicated as likely good. Config $09 worked.

12/12 is pretty good :)

gardners commented 1 month ago

Still waiting for that bitstream, but running the old hyperram test program. It is reading $AE instead of $AA for some bytes when doing the size probe, typically around the ~35MB mark. Read stability test looks good. Mis-write test is also good so far. Checkerboard test looking good, too. Oddly after running checkerboard test SDRAM size probe is now consistently fine ???

gardners commented 1 month ago

Okay, new bitstream is cooked. Testing attictest2.prg on repeated bitstream loads...

Run 1 - Instant pass :) Run 2 - Instant pass :) Run 3 - Instant pass :) Run 4 - Instant pass :) Run 5 - Instant pass :) Run 6 - Instant pass :) Run 7 - Instant pass :) Run 8 - Instant pass :) Run 9 - Instant pass :) Run 10 - Instant pass :) Run 11 - Instant pass :) Run 12 - Instant pass :)

So let's look at what config $09 actually means:

  1. Bits 0 and 1 mean the SDRAM clock is inverted compared with what we had before.
  2. Bit 2 is clear, so we don't latch on falling edge.
  3. Bit 3 is set, so we have 1 cycle of extra latency for reads.
  4. Bit 4 is clear, so we don't invert clocks on write.

Basically we have flipped the SDRAM clock and removed the need for all the inverted clock stuff that we had. So why did this not work originally? Still some questions remain.

gardners commented 1 month ago

Now I need others to test on their R5/R6 boards...

gardners commented 1 month ago

Hmm... Here's my theory as to what might be going on: We were latching on the opposite side of the clock, and thus it was marginal timing as to whether the data arrived a cycle early or not, causing the appearance of the SDRAM maybe not having accepted the latency count via the configuration set command. It's still not totally satisfying, however, as we might expect to see some jitter between reads.

gardners commented 1 month ago

$FFD37FE must be $10, not $14 for SDRAM now, else some failures occur. The difference is that $14 enables the cache. So we have some bug with the slow RAM cache with SDRAM still. That said, the SDRAM without cache is not so much slower than the HyperRAM with cache. But it would still be good to get this working.

Hmm... or maybe I have tickled some other problem.

Okay, so there is still some problem: Even addressed bytes seem to be ok, but odd addressed bytes are being smeared between successive odd addresses it seems. e.g.:

.sc000000 9

.m8000000
:08000000:00000000000000000000000000000000

.s8000001 ff

.m8000000
:08000000:00B70000000000000000000000000000

.s8000001 0 0 ff

.m8000000
:08000000:004800B7000000000000000000000000

.s8000001 0 0 0 0 ff

.m8000000
:08000000:0000004800B700000000000000000000

(not $48 and $B7 are complementary in binary)

But is this happening during read or write?

Changing from config $09 to $02 after the write allows us to read back the $FF correctly. So I'm suspecting it is on the read side.

.sc000000 9

.s8000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

.m8000000
:08000000:00000000000000000000000000000000

.s8000001 ff

.m8000000
:08000000:00B70000000000000000000000000000

.sc000000 2

.m8000000
:08000000:00FF0000000000000000000000000000

BUT in config 2, we can't write anything.

Configs 9 and 2 differ effectively in which side of the clock writes and commands happen on, because flipping bits 0,1 and 3 inverts both the clock phase, and the phase on which reads occur.

This means that the bit for switching whether clock phase is flipped during writes should be working around exactly this problem. But it's not.

If we are issuing commands on wrong side of the clock, then the incorrect command can be latched, including config register set commands. However, we'd expect then that reads and writes would go missing. Well, we are seeing writes go missing in some situations of this...

Configs 2 and 9 should have identical read timing, but clearly don't:

.sc000000 2
.M8000000
:08000000:11223344556677889900000000000000
...
.sc000000 9
.M8000000
:08000000:11623344552E77889900000000000000

Config 9 writes fine, but doesn't read fine.
Config $D = config 9, with read phase switched, and indeed it fixes the reading:

.sc000000 d
.M8000000
:08000000:11223344556677880000000000000000

It should only affect reading, and not writing. And this time at least, it does. Config $D is working with $FFD37FE = $10, as expected.

Passes the attictest2.prg

gardners commented 1 month ago

Updating sdram_settings4.prg to check odd address behaviour:

0 G=-1                                                                           
 5 POKE$D7FE,$10                                                                  
 10 C=$C000000                                                                    
 20 S=$8000001                                                                    
 30 FORCC=0TO63:POKE C,CC                                                         
 40 IFCCAND3=2 OR CCAND3=1THEN NEXT CC                                            
 50 FORI=0TO15:POKES+I,I*17:NEXTI                                                 
 55 FORI=0TO15:Z=PEEK($9000000+I):NEXT:REM BUST CACHE                             
 60 REM FORI=0TO15:PRINTRIGHT$(HEX$(PEEK(S+I)),2);" ";:NEXTI                      
 70 REM PRINT                                                                     
 80 FORI=0TO15:IFPEEK(S+I)=I*17THENNEXTI                                          
 90 IFI<16THEN GOTO 999                                                           
 100 FORI=0TO15:POKES+I,64+I*17:NEXTI                                             
 105 REM FORI=0TO15:PRINTRIGHT$(HEX$(PEEK(S+I)),2);" ";:NEXTI                     
 106 REM PRINT                                                                    
 110 FORI=0TO15:IFPEEK(S+I)=((64+I*17)AND255)THENNEXTI                            
 120 IFI<16THEN GOTO 999                                                          
 130 PRINTHEX$(CC)                                                                
 140 IFG=-1THENG=CC                                                               
 999 NEXTCC                                                                       
 1000 POKEC,G                                                                     
 1010 PRINT"USING CONFIG ";HEX$(G)    

It now recommends config $D instead of config $9.

hyperramtest.prg also likes config $D better. Read stability etc now happy with the cache enabled :)

gardners commented 1 month ago

Now to try many runs and make sure it always works...

Run 1 - good Run 2 - good Run 3 - BAD!

The smearing of bits between the successive odd bytes cannot be resolved by any configuration.

Tellingly, the 2 byte offset is happening again here, so I am still very much of the view that the root cause here is that the configuration register set command is not being accepted.

Try issuing config register set twice in a row, so that the values are stable on the SDRAM bus, even if it is latching on wrong clock half?

gardners commented 1 month ago

That change is synthesising, but let's see how often this problem is occurring

4 more good runs in a row. 1 bad 2 good 1 bad ...

Let's see if the new bitstream helps.

gardners commented 1 month ago

With the new bitstream: 15 good run (and showing many more viable options: $02, $0d, $11, $1e, $22, $2e, $31, $3D (sometimes it has $e and $1d instead? or was that my imagination) In each case, config $2 was selected as the one to use. Updating VHDL to default to that.

gardners commented 1 month ago

Now doing repeated runs to confirm function, with attictest2.prg

7 consecutive run(s) passed, just using default config $2. 1 fail. No configs work. Odd addresses are having their bits spread between successive clocks again.

Changing read latching between positive to negative edge does not resolve the problem this time. Will begin investigating why using writes to $C000000 isn't able to resolve it via repeated configuration of the SDRAM.

gardners commented 1 month ago

Okay, for the first time, I have flipped the SDRAM from fail to pass state, without resetting the FPGA:

.sc000000 2d

.M8000000
:08000000:2D2D2D2D000102032D2D2D2D08090A0B

.sffd37fe 14

.M8000000
:08000000:000102030405060708090A0B0C0D00F

.s8000000 99

.M8000000
:08000000:990102030405060708090A0B0C0D00F

How and why?

gardners commented 1 month ago

Get it to boot in fail state again, and then try to reproduce the mode fix.

Ah, but the new bitstream has also built. So let's test performance again... 20 consecutive pass runs

Okay, so that looks good.

Let's disable the $Cxxxxxx debug register, so that test programs can't mash it.

gardners commented 1 month ago

Something is still sensitive for timing, as the first commit above consistently fails SDRAM test. The 2nd commit above is to resolve this by making it configurable initially, but then allowing $Cxxxxxx to ignore writes after that, thus forcing the logic to still be in there.