MSI stalls on Ultrascale+

smunaut commented 1 year ago

So I was getting in a situation where MSI would just stop being delivered to the host. Looking around I found a comment in litex soc integration code adding some WaitTimer thing with a comment :

# FIXME: On Ultrascale/Ultrascale+ limit rate of IRQs to 1MHz (to prevent issue with
# IRQs stalled)"

Well this didn't prevent them for me.

I dug into the problem and found the root cause.

The problem is in the interface to the pcie_usp_support and what it expects.

i_cfg_interrupt_msi_int_valid   = cfg_msi.valid,
i_cfg_interrupt_msi_int         = cfg_msi.dat,
o_cfg_interrupt_msi_sent        = cfg_msi.ready,
o_cfg_interrupt_msi_fail        = Open(),

So the problem is that the pcie_usp_support internally will detect RISING EDGES on i_cfg_interrupt_msi_int_valid and submit interrupt request to the Xilinx core when they occur. Which is a big problem because if the cross clocking FIFO ever has more than 1 entry in it, after a transfer occurs, valid will STAY high ... and the support code will never submit another request because it saw no rising edge which means ready will always be low ... and the CDC will fill up ... and IRQ are stalled.

As a side note if o_cfg_interrupt_msi_fail ever occur, it will also probably stall too.

So what I ended up doing is this :

diff --git a/litepcie/phy/xilinx_usp_gen3_x8/pcie_usp_support.v b/litepcie/phy/xilinx_usp_gen3_x8/pcie_usp_support.v
index be2c151..8b2e1ac 100644
--- a/litepcie/phy/xilinx_usp_gen3_x8/pcie_usp_support.v
+++ b/litepcie/phy/xilinx_usp_gen3_x8/pcie_usp_support.v
@@ -858,7 +858,7 @@ module pcie_support # (
   wire            cfg_interrupt_msi_int_valid_edge = cfg_interrupt_msi_int_valid_sh == 2'b01;
   always @(posedge user_clk_out)
       if (user_reset_out) cfg_interrupt_msi_int_valid_sh <= 2'd0;
-      else cfg_interrupt_msi_int_valid_sh <= {cfg_interrupt_msi_int_valid_sh[0], cfg_interrupt_msi_int_valid};
+      else cfg_interrupt_msi_int_valid_sh <= {cfg_interrupt_msi_int_valid_sh[0], cfg_interrupt_msi_int_valid & ~(cfg_interrupt_msi_sent | cfg_interrupt_msi_fail)};

   //latch int_enc
   reg [31:0]      cfg_interrupt_msi_int_enc_lat = 32'b0;

This basically forces the re-detection of a rising edge if valid is still high after a 'sent' or 'fail' response.

That seems to have fixed the issue for me.

enjoy-digital commented 1 year ago

Thanks a lot @smunaut! You indeed probably found the root cause, I'll have a closer look at it and apply the changes.

enjoy-digital commented 1 year ago

The changes have been applied to LitePCIe:

And LiteX:

https://github.com/enjoy-digital/litepcie/commit/300fc90a8aaf2c1cc4fe4774db5cda5cefb5e9a5

BTW, I've started cleaning up the Ultrascale(+) support in https://github.com/enjoy-digital/litepcie/commits/xilinx_us_cleanup but still have a bit of work to do. I was also planning to revisit this part of the code while doing so.

Thanks a lot @smunaut!

enjoy-digital / litepcie

MSI stalls on Ultrascale+ #119