Add LPDDR4 PHY - Githubissues

jedrzejboczar commented 3 years ago

This would add support for LPDDR4 PHYs in LiteDRAM. It is still WIP but will be useful to track progress.

In the current implementation there is LPDDR4PHY which will be a base class for a PHY for series 7 which will yet have to be added. Base PHY's job is to convert DFI commands to LPDDR4 commands format and provide signals that are ready to be serialized using hardware blocks in a concrete PHY.

LPDDR4 sends commands on an CA[5:0] SDR lines. DRAM commands consist of 1 or 2 "subcommands" (e.g. ACT = ACT-1 + ACT-2) and each subcommand is sent over 2 clock cycles. Currently the translation has been implemented and there are tests in Migen simulator that verify serialization of CA. Due to 16n prefetch of LPDDR4 the PHY uses 8 DFI phases to avoid complications if the write/read bursts would have to be sent over more than 1 cycle of DRAM controller clock.

jedrzejboczar commented 3 years ago

When working on this I needed to do some changes in litex/litedram that can be merged independently so that this PR will be smaller and include only the changes directly required to add LPDDR4 support.

The following PRs are actually not related to LPDDR4, just minor changes:

These are required to add LPDDR4 but I tried to just extract the functionality that is not necessarily related to LPDDR4:

While working on the simulation I also implemented a generator of GTKWave savefiles that automatically retrieves signal names, groups signals and applies sorting/coloring. Currently it is defined here with a usage example. I thought it could be useful for litex_sim and litescope, so I'll yet create a PR where it could be discussed.

jedrzejboczar commented 3 years ago

I wanted to add that run simulation in Verilator to CI. Currently I added them as regular unit tests: test_lpddr4.py#L1013-L1020 and wanted to run them as part of the test suite. I created a PR which allows for speeding up the simulations: https://github.com/enjoy-digital/litex/pull/813. With the speedup running these 2 tests on my machine takes ~5 minutes which seems like a reasonable time. But I wonder if running them in a separate github workflow would be preferred? This could probably be achieved with proper filtering of test cases or could be run as entirely separate script, as the benchmarks were. What do you think about integrating these tests?

As a side note, while looking at the CI I noticed that, after moving to github actions, the benchmarks with gh-pages deployment are no longer being run. Should we add a separate workflow to run these as it was done with travis (travis.yml)?

jedrzejboczar commented 3 years ago

As the PR is quite big I wanted to give an overview of the changes and potential points to be discussed. Sorry that the description came so long, but I hope that it can be helpful when reviewing the changes. If you have any remarks regarding the implementation I can work on it to improve it.

Directory structure

Initially I was writing in one file as the other PHYs, but as the code amount increased I split the code into several files, all grouped under the litedram/phy/lpddr4 directory. The new files are:

litedram/phy/lpddr4/
├── basephy.py
├── commands.py
├── __init__.py
├── s7phy.py
├── simphy.py
├── sim.py
├── simsoc.py
└── utils.py
test/
└── test_lpddr4.py

There are also some required modifications to existing files:

litedram/
├── init.py        # LPDDR4 initialization sequence and other BIOS-related definitions
├── modules.py     # LPDDR4 module: MT53E256M16D1
└── common.py      # defined LPDDR4 burst length
github/workflows/
└── ci.yml         # dependencies for running Verilator

Core

The LPDDR4 PHY is split into vendor-agnostic core that runs only in memory controller's clock domain, and which is then extended by different wrappers.

Commands

commands.py contains modules for translating DFI commands to sequences on LPDDR4 CS/CA lines. LPDDR4 requires sending some of the commands as pairs of subcommands, e.g. READ (cas_n=0, ras_n=1, we_n=1) becomes READ-1 + CAS-2. Each subcommand is sent over 2 clock cycles, so a command can take 2 or 4 clock cycles. In the first stage DFI command is translated into corresponding pair of commands using DFIPhaseAdapter (one for each phase). Adapters then use Command to map subcommands to CS/CA sequences based on the command truth table.

As not all LPDDR4 commands seem to directly map to DFI commands (there are more LPDDR4 commands), we handle DFI ZQC and MRS commands in a special manner. LPDDR4 has a 256-bit Mode Register space, so 8 bits are needed to address a register. We use DFI.address to encode both address and value of the MRS (Mode Register Set) command as defined here (other PHYs use DFI.bank for register address and DFI.address for register value). ZQC is translated to LPDDR4 MPC (MultiPurpose Command). MPC operand (OP[6:0]) is sent on DFI.address and is interpreted as defined here. Both MRS and ZQC are handled in the initialization code here.

ISSUE: Currently ZQC is performed only during initialization. Doing ZQC during runtime will require specialised Refresher implementation because ZQC has to be done as two different commands and is performed in the background (so other commands can be issued in between) but takes a lot of time (so we cannot just block other commands for that long). It is described here

In LPDDR4 there are also separate commands for WRITE and MASKED-WRITE. Masked write has significantly increased tCCD (32 tCK vs 8 for non-masked write). Currently the masked_write parameter defines which command is used and it defaults to using MASKED-WRITE to avoid issues when masking is needed.

I am not sure how to solve the situation of changing tCCD based on the PHY's write commands type, because in general SDRAMModule and the PHY are created independently, so this would have to be configured in the SoC. Currently I put the options in the module, but we always use the larger one.

PHY

basephy.py defines LPDDR4PHY which is the core of the PHY, wrappers use it as a base class. It is meant to work in the sysclk domain and it converts self.dfi to self.out of type LPDDR4Output which groups all the signals that need to be serialized in given sysclk clock. Concrete implementations of LPDDR4 PHYs derive from this class and (de-)serialize LPDDR4Output (e.g. Series 7 PHY uses I/OSERDESE2 primitives).

LPDDR4PHY has 8 DFI phases due to 16n prefetch used in LPDDR4 memory. This way we can write whole burst in a single memory controller's clock cycle. Core PHY implementation already includes BitSlip modules and provides CSRs to control read/write bitslips. However any DQ/DQS delays are to be implemented in concrete PHYs.

The PHY instantiates one DFIPhaseAdapter for each phase (8 total). Because command sent on any phase can span up to 4 cycles (=4 phases; 1 phase maps to 1 DRAM SDR cycle) there is currently some logic to prevent any command overlaps. This should be removed (or made optional) in near future to avoid wasting resources, because DRAM module timings should guarantee that no overlaps may occur. This way sending overlapping commands will be considered undefined behavior (as commands from all phases will simply be ORed). Another result of commands spanning several phases is that commands may span two subsequent sysclk cycles (command on phase 6 will effectively span subsequent phases 6, 7, 0, 1). For constant BitSlips have been used (increasing the latency by 1).

ConstBitSlip is just a minor modification of BitSlip so maybe it would be good to extend the BitSlip class with an option to have a constant slip.

The rest of the PHY is fairly similar to other PHYs, performing DQ/DQS/DMI serialization.

Because concrete implementations will further increase PHY latency, LPDDR4PHY provides latency parameters ser_latency and des_latency that should be passed by the wrapper. These are used in the core to correctly calculate PhySettings. The latency calculations have been written in verbose manner, so it should be easier to analyze those and find possible bugs.

Double rate PHY

Because we use 8 DFI phases, DDR signals like DQ would require 16:1 serialization. Series 7 FPGAs provide OSERDESE2 that can in theory do up to 14:1, but that is not enough. For this reason DoubleRateLPDDR4PHY is a wrapper over LPDDR4PHY that does partial (de-)serialization, effectively halving the widths of all signals, so that 8:1 serializers can be used. This however increases PHY latency and I believe that current implementations of Serializer and Deserializer could be improved to add lower latencies than they do now.

I'm not sure about the name though, maybe something better could be used.

Series 7 PHY

S7LPDDR4PHY wraps DoubleRateLPDDR4PHY adding I/OSERDESE2 and I/ODELAYE2 primitives. It is fairly similar to the regular S7DDRPHY. Currently Artix 7 is not supported, but in the near future with_odelay argument should be added, the same as in S7DDRPHY.

Simulation

Along with the implementation of LPDDR4 PHY there are also ways to test the PHY in simulation.

In lpddr4/simphy.py there are implementations of LPDDR4SimPHY and DoubleRateLPDDR4SimPHY that wrap the core and perform serialization using Migen serializers. These classes can also serve as a reference for implementing concrete PHYs. Simulation PHYs are used directly for Migen unit tests. Unit tests are defined in test_lpddr4.py (LPDDR4Tests and TestSimSerializers that tests Serializer/Deserializer). The tests just specify sequences of commands on DFI and check expected sequences on pads.

Aside from Migen tests, there is also an implementation of LPDDR4 DRAM simulator in lpddr4/sim.py. It has been written based on LPDDR4 documentation. This is basically a command decoder with logic responsible for transmitting data. SimLogger class has been developed to improve the simulator (could be useful for other simulations) and allows for convenient logging of errors from comb context (so is usable inside FSM, example usage). The simulator reports timing violations and incorrect commands through the logger. The simulator is then used in SimSoC and can be run similarily to litex_sim. In VerilatorLPDDR4Tests there are tests that run the Verilator-based simulations and check for errors/warnings and Memtest OK. The simulator allows to test the implementation of all the modules beside the PHY implementaion for concrete FPGA (e.g. S7LPDDR4PHY).

You should be able to run the simulator using e.g.

python litedram/phy/lpddr4/simsoc.py --log-level info --finish-after-memtest --double-rate-phy --l2-size 0

or with tracing enabled (which will also generate a GTKWave savefile for viewing the signals), e.g.

python litedram/phy/lpddr4/simsoc.py --log-level info --finish-after-memtest --trace --trace-fst --gtkw-savefile

Log level can be controlled in more fine-grained manner by using e.g. --log-level cmd=info,data=debug.

The simulation will even perform read leveling, but in essence it only changes bitslip, it fakes having delays and in init.py we set #define SDRAM_PHY_DELAYS 1.

An example (partial) simulation log can look like:

[           50000 ps] [INFO] RESET released
[           50000 ps] [WARN] tINIT1 violated: RESET deasserted too fast
[           50000 ps] [INFO] CKE rising edge
[           50000 ps] [WARN] tINIT3 violated: CKE set HIGH too fast after RESET being released
[          100000 ps] [INFO] FSM reset
--========== Initialization ============--
Initializing SDRAM @0x40000000...
Switching SDRAM to software control.
[      2000052500 ps] [INFO] FSM: RESET -> EXIT-PD
[      2002055000 ps] [INFO] FSM: EXIT-PD -> MRW
[      2199950000 ps] [INFO] RESET asserted
[      2199950000 ps] [INFO] CKE falling edge
[      2205390000 ps] [INFO] RESET released
[     98205990000 ps] [INFO] CKE rising edge
[     98302540000 ps] [INFO] MRW: MR[ 1] = 0x14
[     98351300000 ps] [INFO] MRW: MR[ 2] = 0x09
[     98400160000 ps] [INFO] MRW: MR[11] = 0x00
[     98448960000 ps] [INFO] MPC: ZQC-START
[     98448962500 ps] [INFO] FSM: MRW -> ZQC
[     98497720000 ps] [INFO] MPC: ZQC-LATCH
[     98497722500 ps] [INFO] FSM: ZQC -> NORMAL
Read leveling:
  m0, b0: |[     98731720000 ps] [INFO] ACT: bank=0 row=     0
[     98754895000 ps] [INFO] MASKED-WRITE: bank=0 row=     0 col=   0
[     98763315000 ps] [INFO] READ: bank=0 row=     0 col=   0
[     98767340000 ps] [INFO] PRE: bank = 0
[     98802380000 ps] [INFO] ACT: bank=0 row=     0
[     98825555000 ps] [INFO] MASKED-WRITE: bank=0 row=     0 col=   0
[     98833975000 ps] [INFO] READ: bank=0 row=     0 col=   0
[     98838000000 ps] [INFO] PRE: bank = 0

Further notes

You will see 2 warnings about timing violation in simulation. This is because LiteDRAM holds reset_n=1 constantly. To perform proper reset we manually force second reset. I also make the assumption that power supply is up for at least 200us before the bitstream is loaded (which effectively releses DRAM reset). This is needed to satisfy tINIT1 timing.
There is also a "catch-all" file utils.py which contains some small functions/modules. I think it migth be good to review these and decide which ones could be put in more general place.
We could include clk_freq in the TimingSettings class to be able to more precisely define delays in init.py.

mithro commented 3 years ago

@jedrzejboczar - Could you make sure your excellent info at https://github.com/enjoy-digital/litedram/pull/224#issuecomment-778236507 is in the documentation somewhere?

jedrzejboczar commented 3 years ago

@mithro Ok, it is already partially included in the docstrings, but I'll make sure to to include the parts that are missing. Here I wanted to summarize it all in a single place to ease review/integration.

jedrzejboczar commented 3 years ago

I think this PR should be ready for review now.

I tried to squash the commits in this PR to make the list a bit smaller, leaving mostly the newest commits. If you think it makes sense it could be just squashed even to single commit.

I also split some small changes into separate PRs with reasons for the changes:

allow for wider DFII dfi.bank: https://github.com/enjoy-digital/litedram/pull/237
refresh all banks: https://github.com/enjoy-digital/litedram/pull/238

The previous comment with code description https://github.com/enjoy-digital/litedram/pull/224#issuecomment-778236507 is mostly relevant. Some of more important things to consider I list below.

Series 7 PHY variants S7LPDDR4PHY now contains different variants per-family just at it is done with S7DDRPHY. Although we were mostly using K7LPDDR4PHY (we have Kintex7 on test board), we have also tested the A7LPDDR4PHY variant to be working.

WRITE vs MASKED-WRITE LPDDR4 has separate WRITE/MASKED-WRITE commands. MASKED-WRITE has larger tCCD, but sometimes (e.g. with only a single port to L2 cache) it is not needed. Currently PHY can be configured to swap between these write commands at runtime by passing Signal as masked_write argument which is then used here. There is however no way to dynamically swap tCCD so using WRITE makes little sense as the larger tCCD must be used anyway.

ZQCS In LPDDR4 we need two separate commands to perform ZQCS. This is not possible with the current Refresher implementation. We would need some way of swapping/extending the Refresher class if the PHY is for LPDDR4.

PHY latency LPDDR4 PHY has 8 phases, but we usually cannot do 16:1 (8:1 DDR) serialization directly with hardware blocks. S7 PHY derives from DoubleRateLPDDR4PHY which implements the intermediate serialization stage, but it introduces (de-)serialization latency that is larger than actually needed. Here serialization adds 1 sys clk, and deserialization 2, but e.g. it should be possible to introduce just 1/2 sys clk (1 sys2x clk) latency during serialization.

Commands encoding The best place to understand how DFI commands are translated to LPDDR4 commands it to check the ca_addressing test and commands.py (especially this part).

Common utilities The whole file utils.py contains code that could potentially be reused elsewhere, so maybe it would be better to move it to some common directory, but to simplify the PR I put it in litedram/phy/lpddr4 for now. Of these:

It should be possible to merge ConstBistSlip with the regular bitslip and dispatch depending on the slp argument (if it is a Signal or and integer). ConstBistSlip tries to avoid being runtime configurable as we use it to simplify this shift+OR operation done when constructing CA/CS signals.
DQSPattern uses 16-bit width instead of 8-bit and adds two changes: 2tCK write preamble and double-toggle wlevel_strobe

Results on hardware We tested the PHY on https://github.com/antmicro/lpddr4-test-board with the following target. After this PR gets merged we will create a PR to litex-boards (but the current lpddr4_test_board target would need some cleanup as it was mostly used for debugging the PHY). From our tests the LPDDR4 PHY nowruns without problems at 800 MT/s:

        __   _ __      _  __
       / /  (_) /____ | |/_/
      / /__/ / __/ -_)>  <
     /____/_/\__/\__/_/|_|
   Build your hardware, easily!

 (c) Copyright 2012-2020 Enjoy-Digital
 (c) Copyright 2007-2015 M-Labs

 BIOS built on Mar 25 2021 12:54:34
 BIOS CRC passed (6e959d20)

 Migen git sha1: 40b1092
 LiteX git sha1: bea82efc

--=============== SoC ==================--
CPU:            VexRiscv_Lite @ 50MHz
BUS:            WISHBONE 32-bit @ 4GiB
CSR:            32-bit data
ROM:            64KiB
SRAM:           8KiB
L2:             0KiB
SDRAM:          524288KiB 16-bit @ 800MT/s (CL-10 CWL-6)

--========== Initialization ============--
Initializing SDRAM @0x40000000...
Switching SDRAM to software control.
Write leveling:
  tCK/4 taps: 8
  Cmd/Clk scan (0-16)
  |1111111111100000| best: 0
  Setting Cmd/Clk delay to 0 taps.
  Data scan:
  m0: |000000000011111111111111| delay: 10
  m1: |000000000011111111111111| delay: 10
Write latency calibration:
m0:0 m1:0
Read leveling:
  m0, b0: |00000000000000000000000000000000| delays: -
  m0, b1: |00000000000000000000000000000000| delays: -
  m0, b2: |00000000000000000000000000000000| delays: -
  m0, b3: |00000000000000000000000000000000| delays: -
  m0, b4: |00000000000000000000000000000000| delays: -
  m0, b5: |00000000000000000000000000000000| delays: -
  m0, b6: |00000000000000000000000000000000| delays: -
  m0, b7: |00000000000000000000000000000000| delays: -
  m0, b8: |11111000000000000000000000000000| delays: 02+-02
  m0, b9: |00000000000011111111000000000000| delays: 16+-04
  m0, b10: |00000000000000000000000000000111| delays: 30+-01
  m0, b11: |00000000000000000000000000000000| delays: -
  m0, b12: |00000000000000000000000000000000| delays: -
  m0, b13: |00000000000000000000000000000000| delays: -
  m0, b14: |00000000000000000000000000000000| delays: -
  m0, b15: |00000000000000000000000000000000| delays: -
  best: m0, b09 delays: 16+-03
  m1, b0: |00000000000000000000000000000000| delays: -
  m1, b1: |00000000000000000000000000000000| delays: -
  m1, b2: |00000000000000000000000000000000| delays: -
  m1, b3: |00000000000000000000000000000000| delays: -
  m1, b4: |00000000000000000000000000000000| delays: -
  m1, b5: |00000000000000000000000000000000| delays: -
  m1, b6: |00000000000000000000000000000000| delays: -
  m1, b7: |00000000000000000000000000000000| delays: -
  m1, b8: |11111000000000000000000000000000| delays: 02+-02
  m1, b9: |00000000000000111111100000000000| delays: 17+-03
  m1, b10: |00000000000000000000000000000011| delays: 31+-01
  m1, b11: |00000000000000000000000000000000| delays: -
  m1, b12: |00000000000000000000000000000000| delays: -
  m1, b13: |00000000000000000000000000000000| delays: -
  m1, b14: |00000000000000000000000000000000| delays: -
  m1, b15: |00000000000000000000000000000000| delays: -
  best: m1, b09 delays: 17+-04
Switching SDRAM to hardware control.
Memtest at 0x40000000 (2MiB)...
  Write: 0x40000000-0x40200000 2MiB
   Read: 0x40000000-0x40200000 2MiB
Memtest OK
Memspeed at 0x40000000 (2MiB)...
  Write speed: 15MiB/s
   Read speed: 7MiB/s

jedrzejboczar commented 3 years ago

Thanks, I'll can the PR with a README that will include the code description from the comment.

enjoy-digital / litedram

Add LPDDR4 PHY #224

Directory structure

Core

Commands

PHY

Double rate PHY

Series 7 PHY

Simulation

Further notes