OVGN / OpenHBMC

Open-source high performance AXI4-based HyperRAM memory controller
Apache License 2.0
57 stars 12 forks source link

Failures on long runs. #15

Open AnttiLukats opened 1 year ago

AnttiLukats commented 1 year ago

This is now very bad issue. Sorry folks. After switching from BUFG mode BUFR/BUFIO mode we did see it working well, but just in case we let the loop test to run. Well first time it did run about a week until it failed. We are now running it all the time, every day we check if it has failed. And we are seeing failures almost every day. So there is a likelihood that the OpenHBMC fails within 24 hours of continuous testing.

This is bad, it actually means that OpenHBMC can not be used in real products. As once a day failure can not be tolerated. If it works, it should work and not fail every other day.

We are sure that our target hardware is near ideal for HyperRAM testing - all hyperbus signals are LESS than 4 mm long! This is amazing layout the HyperRAM sits below FPGA and the wires are really all in the range of 2..4mm! It cant be better than this. So it is for sure not signal integrity issue.

Argh! I recall that we have reports from another HyperRAM IP vendor that some HyperRAM devices itself have failures, like real memory content losses. This would mean that the error is really in the HyperRAM device itself. But how to verify it? Right now we just run the memory tests in forever loop, the error is only telling us the memory width at the failing test, we don't see what data was read or written. And its really not helpful for debugging.

As of now we do not also know if the problem is still related to the Xilinx FIFO and be essentially the same failure as with BUFG versions just happening not fast.

We would be really happy to assist in the debugging of this problem, let us know if we can try out something to rule out some possible causes. I myself have little ideas what we could try.

One interesting option to test the IP would be using CRUVI loopback adapter, but for this testing we would need a HyperRAM emulation model, I am guessing the model offered by Cypress would not work well in FPGA :(

Anyway we are happy to assist with this issue. We really would like to see that HyperRAM would work, and well more than 24 hours!

UPDATE: there is a 3 year old forum entry about errors that happen every 10..20 hours, with different hardware and different IP Core. https://forum.trenz-electronic.de/index.php/topic,1320.0.html

So there are chances that there is something bad with the HyperRAM chip itself. ?

UPDATE2: different HyperRAM chip, different hardware and different IP Core, and also data corruption: https://community.infineon.com/t5/Hyper-RAM/HyperRAM-Memory-Corruption/td-p/281115

It would be really nice to see WHAT type of errors happen here at our testing...

OVGN commented 1 year ago

Hello!

This is bad, it actually means that OpenHBMC can not be used in real products. As once a day failure can not be tolerated. If it works, it should work and not fail every other day.

Absolutely agree and I want to make every effort to get OpenHBMC as stable, as possible.

We are sure that our target hardware is near ideal for HyperRAM testing - all hyperbus signals are LESS than 4 mm long! This is amazing layout the HyperRAM sits below FPGA and the wires are really all in the range of 2..4mm! It cant be better than this. So it is for sure not signal integrity issue.

OpenHBMC's DRU (data recovery unit) module can handle quite large uncertainty (up to 1/6 of hyperbus clock period) between data bits (among each other) and RWDS. It was proved in a DRU testbench. Also PCB delay is about 165ps/inch, so 2..4mm makes extremely small arrival time difference.

As of now we do not also know if the problem is still related to the Xilinx FIFO and be essentially the same failure as with BUFG versions just happening not fast.

Why do you think there is something wrong with Xilinx FIFO? I don't think so. There is probably some issue in my design.

Anyway we are happy to assist with this issue. We really would like to see that HyperRAM would work, and well more than 24 hours!

Me too! Let's try to fix it together.

So there are chances that there is something bad with the HyperRAM chip itself. ?

No, I can't and don't want to believe that.

OVGN commented 1 year ago

It would be really nice to see WHAT type of errors happen here at our testing...

Right! Let me modify memory test software to print additional information about detected error, i.e. address, expected data and actual data. This is quite easy to do. I will commit sources today.

OVGN commented 1 year ago

I myself have little ideas what we could try.

  • change serdes clock from 300 to 301 MHz?
  • change IO slew rate to fast?
  1. I strongly DO NOT recommend to change ration of serdes and hyperbus clock. DRU was designed to work with 1:3 clock ratio, other values are not supported by design.

  2. Yes, you are free to change slew rate to meet signal integrity of your PCB design. Also FPGAs and RAM's IO strength can be modified.

OVGN commented 1 year ago

Started long run memtest


UPDATE: Well...we have a trouble. Pretty fast error catch:

Iter:000001b6
    32-bit test: PASSED!
    16-bit test: PASSED!
     8-bit test: PASSED!
Iter:000001b7
Error at: 0x7608d190 expected: 0x89f72e6f actual: 0xffffffff

The cpu loops in a while(1) cycle after any error. If I stop cpu and read memory manually, there is a correct value in RAM: Screenshot from 2023-03-02 11-32-09 This is definitely not RAM corruption. Something wrong is with my memory controller. Going to research this. I'm going to run test again to collect some statistics. If I get 0xffffffff more and more times, I can exclude this value from test at all and create an ILA trigger to catch exactly this incorrect read.

AnttiLukats commented 1 year ago

1) I am pretty sure this is BUFG error, change to BUFR! We can not run long time testing with bufg at all, as it fails FAST!

2) we can not rule out that the HyperRAM chip itself is bad, we have a report from an industrial customer who says that at OLDER versions of HyperRAM die (a and b?) have problems with data corruption! This is confirmed, those guys really had an issue with commercial IP core and custom hardware.

3) yesterday I did see error with 32 bit test doing INV ADDR test, I change the memtest do only do this test, and started long time testing yesterday (BUFR version) so far it is not failing so running 52+ hours. I will report back when it fails, can take time...

AnttiLukats commented 1 year ago

UPDATE: I double checked the HyperRAM on our board, it is die REV D, the latest and recommended one. So we can assume we have no problems with the HyperRAM device itself? Rev D has errata with short bursts, but this does not apply here.