enjoy-digital / litedram

Small footprint and configurable DRAM core
Other
365 stars 115 forks source link

Changes to tREFI ignored #264

Closed SLongofono closed 2 years ago

SLongofono commented 2 years ago

I'm trying to induce refresh-rate errors to test a DDR3 memory by changing tREFI in modules.py, but some step in the HDL generation seems to be ignoring the value I have set.

As an example, when changing tREFI from the default value of 64e6/8192 to 64e6, I would expect there to be errors with a simple write-then-read test of the main memory.

Is there some safety rail in place that is silently overriding the values I enter in modules.py?

If it is relevant, I'm targeting the Xilinx vc707 evaluation board with default settings.

mithro commented 2 years ago

@kgugala - Anyone at Antmicro know things about this?

gatecat commented 2 years ago

Can you describe the test you are doing in a bit more detail? I've seen cases where the memory access was hitting L2 cache only rather than the DRAM itself confusing tests personally.

SLongofono commented 2 years ago

Sure, I'm following the example in memtest.c as a baseline.

This test runs as-is as a part of the boot/initialization process. It completes without reporting errors.

I've simplified it somewhat to implement a "walking ones" test (0x1, 0x2, 0x4, ... 0x800000), but the two calls to flush the cpu/l2 cache are still there. Same results, no errors.

jedrzejboczar commented 2 years ago

The memtest you are using has minimal delay between writing all the data and reading it and that might be too little to actually observe the errors.You could try adding a few seconds of delay between writing and reading.

I did some tests on Arty board with DDR3 using DMA to fill and then scan the whole memory using our rowhammer tester. I tried the same modification of tREFI as you did and I needed ~1 second between writing and reading to see any error. With 3 second delay there were ~80 errors (in the whole memory, errors are the number of erroneous 128-bit transfers, not the number of bitflips). When doing repeated scans every 0.5 second there were no errors at all (was scanning for ~5 minutes). This is most likely because a read scan is in fact a refresh done for the whole memory (each row has to be opened and closed => refreshed).

SLongofono commented 2 years ago

@jedrzejboczar thanks for the additional insight, I had not considered that I would be able to read fast enough to effectively refresh the region under test. That makes good sense. After about 5 seconds of pause, I started to see errors.