Closed qianyich closed 7 months ago
@zhguanw-amd the problem in network_systolic_mm has been resolved. I ran network_systolic_mm and read test back and forth. first network_systolic_mm, then read test, and again network_systolic_mm, and finally read test with no issues.
Right after that, I ran a write test with no issue, it passed. And the send_recv test also passed. At this point. I think they are all good, but I ran the read test again, I had a result mismatch error, and the DMA engine broke (found broken on both machines verified by the DMA test). There are no error messages reported by dmesg until I run the DMA test. All I can see is the last read test failed with result mismatch in the log. Therefore, the bug could be in either the write test or send_recv test.
I did a reboot after having this failure. The DMA went back to work (yes, I ran DMA test to verify it) and passed the network_systolic_mm test again, but failed to pass read test. Warning: CQHEADi and SQPIi for QP2 are mismatched
, and I guess this warning is from either the write and send_recv test before reboot.
Warning: CQHEADi and SQPIi for QP2 are mismatched
***** QP2 FATAL RECOVERY *****
TIMEOUT: CQHEADi:0x0 and SQPIi:0x1 are different
dmesg before reboot and after running the write and send_recv test:
[ 516.463259] IPv6: ADDRCONF(NETDEV_UP): enp59s0: link is not ready
[ 516.463268] IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0: link becomes ready
[ 644.544157] onic 0000:3b:00.0: reconic-mm: Close onic_cdev.
[ 650.451113] onic 0000:3b:00.0: reconic-mm: Close onic_cdev.
[ 704.906296] onic 0000:3b:00.0: reconic-mm: Close onic_cdev.
[ 740.446916] onic 0000:3b:00.0: reconic-mm: Close onic_cdev.
[ 760.696267] onic 0000:3b:00.0: reconic-mm: Close onic_cdev.
[ 846.452736] onic 0000:3b:00.0: reconic-mm: Close onic_cdev.
[ 894.941990] onic 0000:3b:00.0: reconic-mm: Close onic_cdev.
[ 915.548327] onic 0000:3b:00.0: reconic-mm: Close onic_cdev.
[ 1034.975745] onic:qdma_request_wait_for_cmpl: qdma3b000-MM-67: req 0x00000000eececbd4, W,65536000,0/65536000,0x0, done 0, err 0, tm 10000.
[ 1034.988096] onic:qdma_descq_dump: qdma3b000-MM-67: 0x43/0x43, desc sz 1024/1022, pidx 641, cidx 640
[ 1034.988429] onic 0000:3b:00.0: reconic-mm: Close onic_cdev.
[ 1037.802323] onic:error_intr_handler: Error IRQ fired on Funtion#0: index=7, vector=272
[ 1037.802330] eqdma_hw_error_process: Global Err Reg(0x248) = 0x4
[ 1037.802332] addr = 0x00000254 val = 0x00100000
[ 1037.802339] GLBL_DSC_ERR_STS 0x254 0x100000 1048576
[ 1037.802341] GLBL_DSC_ERR_STS_RSVD_1 [31,26] 0x0
[ 1037.802343] GLBL_DSC_ERR_STS_PORT_ID [ 25] 0x0
[ 1037.802345] GLBL_DSC_ERR_STS_SBE [ 24] 0x0
[ 1037.802346] GLBL_DSC_ERR_STS_DBE [ 23] 0x0
[ 1037.802347] GLBL_DSC_ERR_STS_RQ_CANCEL [ 22] 0x0
[ 1037.802348] GLBL_DSC_ERR_STS_DSC [ 21] 0x0
[ 1037.802350] GLBL_DSC_ERR_STS_DMA [ 20] 0x1
[ 1037.802351] GLBL_DSC_ERR_STS_FLR_CANCEL [ 19] 0x0
[ 1037.802353] GLBL_DSC_ERR_STS_RSVD_2 [18,17] 0x0
[ 1037.802354] GLBL_DSC_ERR_STS_DAT_POISON [ 16] 0x0
[ 1037.802355] GLBL_DSC_ERR_STS_TIMEOUT [ 9] 0x0
[ 1037.802357] GLBL_DSC_ERR_STS_FLR [ 8] 0x0
[ 1037.802358] GLBL_DSC_ERR_STS_TAG [ 6] 0x0
[ 1037.802359] GLBL_DSC_ERR_STS_ADDR [ 5] 0x0
[ 1037.802361] GLBL_DSC_ERR_STS_PARAM [ 4] 0x0
[ 1037.802362] GLBL_DSC_ERR_STS_BCNT [ 3] 0x0
[ 1037.802363] GLBL_DSC_ERR_STS_UR_CA [ 2] 0x0
[ 1037.802364] GLBL_DSC_ERR_STS_POISON [ 1] 0x0
[ 1037.802370] GLBL_DSC_ERR_LOG0 0x25c 0xc0000043 -1073741757
[ 1037.802371] GLBL_DSC_ERR_LOG0_VALID [ 31] 0x1
[ 1037.802373] GLBL_DSC_ERR_LOG0_SEL [ 30] 0x1
[ 1037.802374] GLBL_DSC_ERR_LOG0_RSVD_1 [29,13] 0x0
[ 1037.802376] GLBL_DSC_ERR_LOG0_QID [12, 0] 0x43
[ 1037.802381] GLBL_DSC_ERR_LOG1 0x260 0x280014 2621460
[ 1037.802382] GLBL_DSC_ERR_LOG1_RSVD_1 [31,28] 0x0
[ 1037.802384] GLBL_DSC_ERR_LOG1_CIDX [27,12] 0x280
[ 1037.802385] GLBL_DSC_ERR_LOG1_RSVD_2 [11, 9] 0x0
[ 1037.802387] GLBL_DSC_ERR_LOG1_SUB_TYPE [ 8, 5] 0x0
[ 1037.802389] GLBL_DSC_ERR_LOG1_ERR_TYPE [ 4, 0] 0x14
[ 1037.802393] GLBL_DSC_DBG_DAT0 0x270 0x0 0
[ 1037.802395] GLBL_DSC_DAT0_RSVD_1 [31,30] 0x0
[ 1037.802396] GLBL_DSC_DAT0_CTXT_ARB_DIR [ 29] 0x0
[ 1037.802398] GLBL_DSC_DAT0_CTXT_ARB_QID [28,17] 0x0
[ 1037.802399] GLBL_DSC_DAT0_CTXT_ARB_REQ [16,12] 0x0
[ 1037.802400] GLBL_DSC_DAT0_IRQ_FIFO_FL [ 11] 0x0
[ 1037.802402] GLBL_DSC_DAT0_TMSTALL [ 10] 0x0
[ 1037.802403] GLBL_DSC_DAT0_RRQ_STALL [ 9, 8] 0x0
[ 1037.802405] GLBL_DSC_DAT0_RCP_FIFO_SPC_STALL [ 7, 6] 0x0
[ 1037.802406] GLBL_DSC_DAT0_RRQ_FIFO_SPC_STALL [ 5, 4] 0x0
[ 1037.802408] GLBL_DSC_DAT0_FAB_MRKR_RSP_STALL [ 3, 2] 0x0
[ 1037.802409] GLBL_DSC_DAT0_DSC_OUT_STALL [ 1, 0] 0x0
[ 1037.802413] GLBL_DSC_DBG_DAT1 0x274 0x0 0
[ 1037.802415] GLBL_DSC_DAT1_RSVD_1 [31,28] 0x0
[ 1037.802416] GLBL_DSC_DAT1_EVT_SPC_C2H [27,22] 0x0
[ 1037.802418] GLBL_DSC_DAT1_EVT_SP_H2C [21,16] 0x0
[ 1037.802420] GLBL_DSC_DAT1_DSC_SPC_C2H [15, 8] 0x0
[ 1037.802421] GLBL_DSC_DAT1_DSC_SPC_H2C [ 7, 0] 0x0
[ 1037.802426] GLBL_DSC_ERR_LOG2 0x27c 0x2800280 41943680
[ 1037.802428] GLBL_DSC_ERR_LOG2_OLD_PIDX [31,16] 0x280
[ 1037.802429] GLBL_DSC_ERR_LOG2_NEW_PIDX [15, 0] 0x280
[ 1037.802431] eqdma_hw_error_process detected DMA engine error
[ 1037.815936] onic:error_intr_handler: Error IRQ fired on Funtion#0: index=7, vector=272
[ 1037.815942] eqdma_hw_error_process: Global Err Reg(0x248) = 0x4
[ 1037.815944] addr = 0x00000254 val = 0x00100000
[ 1037.815951] GLBL_DSC_ERR_STS 0x254 0x100000 1048576
[ 1037.815953] GLBL_DSC_ERR_STS_RSVD_1 [31,26] 0x0
[ 1037.815955] GLBL_DSC_ERR_STS_PORT_ID [ 25] 0x0
[ 1037.815957] GLBL_DSC_ERR_STS_SBE [ 24] 0x0
[ 1037.815958] GLBL_DSC_ERR_STS_DBE [ 23] 0x0
[ 1037.815959] GLBL_DSC_ERR_STS_RQ_CANCEL [ 22] 0x0
[ 1037.815961] GLBL_DSC_ERR_STS_DSC [ 21] 0x0
[ 1037.815962] GLBL_DSC_ERR_STS_DMA [ 20] 0x1
[ 1037.815963] GLBL_DSC_ERR_STS_FLR_CANCEL [ 19] 0x0
[ 1037.815965] GLBL_DSC_ERR_STS_RSVD_2 [18,17] 0x0
[ 1037.815966] GLBL_DSC_ERR_STS_DAT_POISON [ 16] 0x0
[ 1037.815968] GLBL_DSC_ERR_STS_TIMEOUT [ 9] 0x0
[ 1037.815969] GLBL_DSC_ERR_STS_FLR [ 8] 0x0
[ 1037.815970] GLBL_DSC_ERR_STS_TAG [ 6] 0x0
[ 1037.815971] GLBL_DSC_ERR_STS_ADDR [ 5] 0x0
[ 1037.815973] GLBL_DSC_ERR_STS_PARAM [ 4] 0x0
[ 1037.815974] GLBL_DSC_ERR_STS_BCNT [ 3] 0x0
[ 1037.815975] GLBL_DSC_ERR_STS_UR_CA [ 2] 0x0
[ 1037.815976] GLBL_DSC_ERR_STS_POISON [ 1] 0x0
[ 1037.815982] GLBL_DSC_ERR_LOG0 0x25c 0xc0000043 -1073741757
[ 1037.815983] GLBL_DSC_ERR_LOG0_VALID [ 31] 0x1
[ 1037.815985] GLBL_DSC_ERR_LOG0_SEL [ 30] 0x1
[ 1037.815986] GLBL_DSC_ERR_LOG0_RSVD_1 [29,13] 0x0
[ 1037.815988] GLBL_DSC_ERR_LOG0_QID [12, 0] 0x43
[ 1037.815993] GLBL_DSC_ERR_LOG1 0x260 0x280014 2621460
[ 1037.815994] GLBL_DSC_ERR_LOG1_RSVD_1 [31,28] 0x0
[ 1037.815996] GLBL_DSC_ERR_LOG1_CIDX [27,12] 0x280
[ 1037.815997] GLBL_DSC_ERR_LOG1_RSVD_2 [11, 9] 0x0
[ 1037.815999] GLBL_DSC_ERR_LOG1_SUB_TYPE [ 8, 5] 0x0
[ 1037.816000] GLBL_DSC_ERR_LOG1_ERR_TYPE [ 4, 0] 0x14
[ 1037.816005] GLBL_DSC_DBG_DAT0 0x270 0x0 0
[ 1037.816006] GLBL_DSC_DAT0_RSVD_1 [31,30] 0x0
[ 1037.816008] GLBL_DSC_DAT0_CTXT_ARB_DIR [ 29] 0x0
[ 1037.816009] GLBL_DSC_DAT0_CTXT_ARB_QID [28,17] 0x0
[ 1037.816022] GLBL_DSC_DAT0_CTXT_ARB_REQ [16,12] 0x0
[ 1037.816022] GLBL_DSC_DAT0_IRQ_FIFO_FL [ 11] 0x0
[ 1037.816023] GLBL_DSC_DAT0_TMSTALL [ 10] 0x0
[ 1037.816023] GLBL_DSC_DAT0_RRQ_STALL [ 9, 8] 0x0
[ 1037.816024] GLBL_DSC_DAT0_RCP_FIFO_SPC_STALL [ 7, 6] 0x0
[ 1037.816025] GLBL_DSC_DAT0_RRQ_FIFO_SPC_STALL [ 5, 4] 0x0
[ 1037.816025] GLBL_DSC_DAT0_FAB_MRKR_RSP_STALL [ 3, 2] 0x0
[ 1037.816026] GLBL_DSC_DAT0_DSC_OUT_STALL [ 1, 0] 0x0
[ 1037.816029] GLBL_DSC_DBG_DAT1 0x274 0x0 0
[ 1037.816030] GLBL_DSC_DAT1_RSVD_1 [31,28] 0x0
[ 1037.816030] GLBL_DSC_DAT1_EVT_SPC_C2H [27,22] 0x0
[ 1037.816031] GLBL_DSC_DAT1_EVT_SP_H2C [21,16] 0x0
[ 1037.816032] GLBL_DSC_DAT1_DSC_SPC_C2H [15, 8] 0x0
[ 1037.816032] GLBL_DSC_DAT1_DSC_SPC_H2C [ 7, 0] 0x0
[ 1037.816036] GLBL_DSC_ERR_LOG2 0x27c 0x2800280 41943680
[ 1037.816036] GLBL_DSC_ERR_LOG2_OLD_PIDX [31,16] 0x280
[ 1037.816037] GLBL_DSC_ERR_LOG2_NEW_PIDX [15, 0] 0x280
[ 1037.816037] eqdma_hw_error_process detected DMA engine error
[ 1037.829666] onic:error_intr_handler: Error IRQ fired on Funtion#0: index=7, vector=272
[ 1037.829673] eqdma_hw_error_process: Global Err Reg(0x248) = 0x4
[ 1037.829676] addr = 0x00000254 val = 0x00100000
[ 1037.829682] GLBL_DSC_ERR_STS 0x254 0x100000 1048576
[ 1037.829684] GLBL_DSC_ERR_STS_RSVD_1 [31,26] 0x0
[ 1037.829686] GLBL_DSC_ERR_STS_PORT_ID [ 25] 0x0
[ 1037.829688] GLBL_DSC_ERR_STS_SBE [ 24] 0x0
[ 1037.829689] GLBL_DSC_ERR_STS_DBE [ 23] 0x0
[ 1037.829690] GLBL_DSC_ERR_STS_RQ_CANCEL [ 22] 0x0
[ 1037.829692] GLBL_DSC_ERR_STS_DSC [ 21] 0x0
[ 1037.829693] GLBL_DSC_ERR_STS_DMA [ 20] 0x1
[ 1037.829694] GLBL_DSC_ERR_STS_FLR_CANCEL [ 19] 0x0
[ 1037.829696] GLBL_DSC_ERR_STS_RSVD_2 [18,17] 0x0
[ 1037.829697] GLBL_DSC_ERR_STS_DAT_POISON [ 16] 0x0
[ 1037.829699] GLBL_DSC_ERR_STS_TIMEOUT [ 9] 0x0
[ 1037.829700] GLBL_DSC_ERR_STS_FLR [ 8] 0x0
[ 1037.829701] GLBL_DSC_ERR_STS_TAG [ 6] 0x0
[ 1037.829703] GLBL_DSC_ERR_STS_ADDR [ 5] 0x0
[ 1037.829704] GLBL_DSC_ERR_STS_PARAM [ 4] 0x0
[ 1037.829705] GLBL_DSC_ERR_STS_BCNT [ 3] 0x0
[ 1037.829706] GLBL_DSC_ERR_STS_UR_CA [ 2] 0x0
[ 1037.829708] GLBL_DSC_ERR_STS_POISON [ 1] 0x0
[ 1037.829713] GLBL_DSC_ERR_LOG0 0x25c 0xc0000043 -1073741757
[ 1037.829714] GLBL_DSC_ERR_LOG0_VALID [ 31] 0x1
[ 1037.829716] GLBL_DSC_ERR_LOG0_SEL [ 30] 0x1
[ 1037.829717] GLBL_DSC_ERR_LOG0_RSVD_1 [29,13] 0x0
[ 1037.829719] GLBL_DSC_ERR_LOG0_QID [12, 0] 0x43
[ 1037.829724] GLBL_DSC_ERR_LOG1 0x260 0x280014 2621460
[ 1037.829725] GLBL_DSC_ERR_LOG1_RSVD_1 [31,28] 0x0
[ 1037.829727] GLBL_DSC_ERR_LOG1_CIDX [27,12] 0x280
[ 1037.829728] GLBL_DSC_ERR_LOG1_RSVD_2 [11, 9] 0x0
[ 1037.829730] GLBL_DSC_ERR_LOG1_SUB_TYPE [ 8, 5] 0x0
[ 1037.829731] GLBL_DSC_ERR_LOG1_ERR_TYPE [ 4, 0] 0x14
[ 1037.829736] GLBL_DSC_DBG_DAT0 0x270 0x0 0
[ 1037.829737] GLBL_DSC_DAT0_RSVD_1 [31,30] 0x0
[ 1037.829739] GLBL_DSC_DAT0_CTXT_ARB_DIR [ 29] 0x0
[ 1037.829740] GLBL_DSC_DAT0_CTXT_ARB_QID [28,17] 0x0
[ 1037.829742] GLBL_DSC_DAT0_CTXT_ARB_REQ [16,12] 0x0
[ 1037.829743] GLBL_DSC_DAT0_IRQ_FIFO_FL [ 11] 0x0
[ 1037.829744] GLBL_DSC_DAT0_TMSTALL [ 10] 0x0
[ 1037.829746] GLBL_DSC_DAT0_RRQ_STALL [ 9, 8] 0x0
[ 1037.829747] GLBL_DSC_DAT0_RCP_FIFO_SPC_STALL [ 7, 6] 0x0
[ 1037.829749] GLBL_DSC_DAT0_RRQ_FIFO_SPC_STALL [ 5, 4] 0x0
[ 1037.829750] GLBL_DSC_DAT0_FAB_MRKR_RSP_STALL [ 3, 2] 0x0
[ 1037.829752] GLBL_DSC_DAT0_DSC_OUT_STALL [ 1, 0] 0x0
[ 1037.829767] GLBL_DSC_DBG_DAT1 0x274 0x0 0
[ 1037.829768] GLBL_DSC_DAT1_RSVD_1 [31,28] 0x0
[ 1037.829768] GLBL_DSC_DAT1_EVT_SPC_C2H [27,22] 0x0
[ 1037.829769] GLBL_DSC_DAT1_EVT_SP_H2C [21,16] 0x0
[ 1037.829769] GLBL_DSC_DAT1_DSC_SPC_C2H [15, 8] 0x0
[ 1037.829770] GLBL_DSC_DAT1_DSC_SPC_H2C [ 7, 0] 0x0
[ 1037.829773] GLBL_DSC_ERR_LOG2 0x27c 0x2800280 41943680
[ 1037.829774] GLBL_DSC_ERR_LOG2_OLD_PIDX [31,16] 0x280
[ 1037.829774] GLBL_DSC_ERR_LOG2_NEW_PIDX [15, 0] 0x280
[ 1037.829775] eqdma_hw_error_process detected DMA engine error
[ 1037.843411] onic:error_intr_handler: Error IRQ fired on Funtion#0: index=7, vector=272
[ 1037.843418] eqdma_hw_error_process: Global Err Reg(0x248) = 0x4
[ 1037.843420] addr = 0x00000254 val = 0x00100000
[ 1037.843426] GLBL_DSC_ERR_STS 0x254 0x100000 1048576
[ 1037.843429] GLBL_DSC_ERR_STS_RSVD_1 [31,26] 0x0
[ 1037.843431] GLBL_DSC_ERR_STS_PORT_ID [ 25] 0x0
[ 1037.843432] GLBL_DSC_ERR_STS_SBE [ 24] 0x0
[ 1037.843433] GLBL_DSC_ERR_STS_DBE [ 23] 0x0
[ 1037.843435] GLBL_DSC_ERR_STS_RQ_CANCEL [ 22] 0x0
[ 1037.843436] GLBL_DSC_ERR_STS_DSC [ 21] 0x0
[ 1037.843437] GLBL_DSC_ERR_STS_DMA [ 20] 0x1
[ 1037.843439] GLBL_DSC_ERR_STS_FLR_CANCEL [ 19] 0x0
[ 1037.843440] GLBL_DSC_ERR_STS_RSVD_2 [18,17] 0x0
[ 1037.843442] GLBL_DSC_ERR_STS_DAT_POISON [ 16] 0x0
[ 1037.843443] GLBL_DSC_ERR_STS_TIMEOUT [ 9] 0x0
[ 1037.843444] GLBL_DSC_ERR_STS_FLR [ 8] 0x0
[ 1037.843446] GLBL_DSC_ERR_STS_TAG [ 6] 0x0
[ 1037.843447] GLBL_DSC_ERR_STS_ADDR [ 5] 0x0
[ 1037.843449] GLBL_DSC_ERR_STS_PARAM [ 4] 0x0
[ 1037.843450] GLBL_DSC_ERR_STS_BCNT [ 3] 0x0
[ 1037.843451] GLBL_DSC_ERR_STS_UR_CA [ 2] 0x0
[ 1037.843452] GLBL_DSC_ERR_STS_POISON [ 1] 0x0
[ 1037.843458] GLBL_DSC_ERR_LOG0 0x25c 0xc0000043 -1073741757
[ 1037.843459] GLBL_DSC_ERR_LOG0_VALID [ 31] 0x1
[ 1037.843461] GLBL_DSC_ERR_LOG0_SEL [ 30] 0x1
[ 1037.843462] GLBL_DSC_ERR_LOG0_RSVD_1 [29,13] 0x0
[ 1037.843464] GLBL_DSC_ERR_LOG0_QID [12, 0] 0x43
[ 1037.843469] GLBL_DSC_ERR_LOG1 0x260 0x280014 2621460
[ 1037.843470] GLBL_DSC_ERR_LOG1_RSVD_1 [31,28] 0x0
[ 1037.843472] GLBL_DSC_ERR_LOG1_CIDX [27,12] 0x280
[ 1037.843473] GLBL_DSC_ERR_LOG1_RSVD_2 [11, 9] 0x0
[ 1037.843475] GLBL_DSC_ERR_LOG1_SUB_TYPE [ 8, 5] 0x0
[ 1037.843476] GLBL_DSC_ERR_LOG1_ERR_TYPE [ 4, 0] 0x14
[ 1037.843481] GLBL_DSC_DBG_DAT0 0x270 0x0 0
[ 1037.843483] GLBL_DSC_DAT0_RSVD_1 [31,30] 0x0
[ 1037.843484] GLBL_DSC_DAT0_CTXT_ARB_DIR [ 29] 0x0
[ 1037.843485] GLBL_DSC_DAT0_CTXT_ARB_QID [28,17] 0x0
[ 1037.843487] GLBL_DSC_DAT0_CTXT_ARB_REQ [16,12] 0x0
[ 1037.843488] GLBL_DSC_DAT0_IRQ_FIFO_FL [ 11] 0x0
[ 1037.843490] GLBL_DSC_DAT0_TMSTALL [ 10] 0x0
[ 1037.843491] GLBL_DSC_DAT0_RRQ_STALL [ 9, 8] 0x0
[ 1037.843493] GLBL_DSC_DAT0_RCP_FIFO_SPC_STALL [ 7, 6] 0x0
[ 1037.843494] GLBL_DSC_DAT0_RRQ_FIFO_SPC_STALL [ 5, 4] 0x0
[ 1037.843496] GLBL_DSC_DAT0_FAB_MRKR_RSP_STALL [ 3, 2] 0x0
[ 1037.843497] GLBL_DSC_DAT0_DSC_OUT_STALL [ 1, 0] 0x0
[ 1037.843501] GLBL_DSC_DBG_DAT1 0x274 0x0 0
[ 1037.843503] GLBL_DSC_DAT1_RSVD_1 [31,28] 0x0
[ 1037.843504] GLBL_DSC_DAT1_EVT_SPC_C2H [27,22] 0x0
[ 1037.843506] GLBL_DSC_DAT1_EVT_SP_H2C [21,16] 0x0
[ 1037.843507] GLBL_DSC_DAT1_DSC_SPC_C2H [15, 8] 0x0
[ 1037.843509] GLBL_DSC_DAT1_DSC_SPC_H2C [ 7, 0] 0x0
[ 1037.843513] GLBL_DSC_ERR_LOG2 0x27c 0x2800280 41943680
[ 1037.843515] GLBL_DSC_ERR_LOG2_OLD_PIDX [31,16] 0x280
[ 1037.843516] GLBL_DSC_ERR_LOG2_NEW_PIDX [15, 0] 0x280
[ 1037.843518] eqdma_hw_error_process detected DMA engine error
[ 1048.031802] onic:qdma_request_wait_for_cmpl: qdma3b000-MM-67: req 0x00000000b58b1766, R,4190208,0/65536000,0x0, done 0, err 0, tm 10000.
[ 1048.044054] onic:qdma_descq_dump: qdma3b000-MM-67: 0x43/0x43, desc sz 1024/0, pidx 639, cidx 640
[ 1048.044464] onic 0000:3b:00.0: reconic-mm: Close onic_cdev.
read log on 191.100.51.1 before reboot. This does not have QP2 Fatal Recovery. I tried a few more times, and the log started to show QP2's CQ and SQ mismatch and fatal recovery, and then I reboot the machines:
sudo env LD_LIBRARY_PATH=$LD_LIBRARY_PATH ./read -r 192.100.51.1 -i 192.100.52.1 -p /sys/bus/pci/devices/0000\:3b\:00.0/resource2 -z 128 -l host_mem -d /dev/reconic-mm -c -u 22222 -t 11111 --dst_qp 2 -g 2>&1 | tee client_debug.log
src_ip_str = 192.100.51.1
dst_ip_str = 192.100.52.1
Info: mac_addr_t = 00:0a:35:f7:81:e1
Info: PCIe resource file: /sys/bus/pci/devices/0000:3b:00.0/resource2
Info: QP allocated at: host_mem
Info: Device - /dev/reconic-mm
Info: src_ip = 192.100.51.1
Info: Found network interface: enp59s0
Info: mac_addr_t = 00:0a:35:54:60:02
Info: Creating rn_dev
/users/qianyich/RecoNIC/lib/reconic.c:301:create_rn_dev(): Info: scr(=4)) file open successfully
create_rn_dev - testing2
/users/qianyich/RecoNIC/lib/reconic.c:146:get_buffer_paddr(): Info: get_buffer_paddr - Page frame: 0x2f6b400
/users/qianyich/RecoNIC/lib/reconic.c:151:get_buffer_paddr(): Info: get_buffer_paddr - distance from page boundary: 0x0
/users/qianyich/RecoNIC/lib/reconic.c:155:get_buffer_paddr(): Info: get_buffer_paddr - Physical address of buffer: 0x2f6b400000
Info: pre-allocated hugepage buffer vir addr = 0x7f93bf000000, physical addr = 0x2f6b400000
Info: Configuring 8 windows in QDMA AXI bridge BDF, each has 128GB mapping
/users/qianyich/RecoNIC/lib/reconic.c:198:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_LSB=0x16420, bdf_addr_low=0x0
/users/qianyich/RecoNIC/lib/reconic.c:199:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_MSB=0x16424, bdf_addr_high=0x0
/users/qianyich/RecoNIC/lib/reconic.c:200:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_MAP_CONTROL_ADDR=0x16430, bdf_win_config=0xc2000000
/users/qianyich/RecoNIC/lib/reconic.c:198:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_LSB=0x16440, bdf_addr_low=0x0
/users/qianyich/RecoNIC/lib/reconic.c:199:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_MSB=0x16444, bdf_addr_high=0x20
/users/qianyich/RecoNIC/lib/reconic.c:200:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_MAP_CONTROL_ADDR=0x16450, bdf_win_config=0xc2000000
/users/qianyich/RecoNIC/lib/reconic.c:198:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_LSB=0x16460, bdf_addr_low=0x0
/users/qianyich/RecoNIC/lib/reconic.c:199:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_MSB=0x16464, bdf_addr_high=0x40
/users/qianyich/RecoNIC/lib/reconic.c:200:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_MAP_CONTROL_ADDR=0x16470, bdf_win_config=0xc2000000
/users/qianyich/RecoNIC/lib/reconic.c:198:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_LSB=0x16480, bdf_addr_low=0x0
/users/qianyich/RecoNIC/lib/reconic.c:199:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_MSB=0x16484, bdf_addr_high=0x60
/users/qianyich/RecoNIC/lib/reconic.c:200:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_MAP_CONTROL_ADDR=0x16490, bdf_win_config=0xc2000000
/users/qianyich/RecoNIC/lib/reconic.c:198:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_LSB=0x164a0, bdf_addr_low=0x0
/users/qianyich/RecoNIC/lib/reconic.c:199:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_MSB=0x164a4, bdf_addr_high=0x80
/users/qianyich/RecoNIC/lib/reconic.c:200:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_MAP_CONTROL_ADDR=0x164b0, bdf_win_config=0xc2000000
/users/qianyich/RecoNIC/lib/reconic.c:198:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_LSB=0x164c0, bdf_addr_low=0x0
/users/qianyich/RecoNIC/lib/reconic.c:199:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_MSB=0x164c4, bdf_addr_high=0xa0
/users/qianyich/RecoNIC/lib/reconic.c:200:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_MAP_CONTROL_ADDR=0x164d0, bdf_win_config=0xc2000000
/users/qianyich/RecoNIC/lib/reconic.c:198:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_LSB=0x164e0, bdf_addr_low=0x0
/users/qianyich/RecoNIC/lib/reconic.c:199:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_MSB=0x164e4, bdf_addr_high=0xc0
/users/qianyich/RecoNIC/lib/reconic.c:200:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_MAP_CONTROL_ADDR=0x164f0, bdf_win_config=0xc2000000
/users/qianyich/RecoNIC/lib/reconic.c:198:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_LSB=0x16500, bdf_addr_low=0x0
/users/qianyich/RecoNIC/lib/reconic.c:199:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_ADDR_TRANSLATE_ADDR_MSB=0x16504, bdf_addr_high=0xe0
/users/qianyich/RecoNIC/lib/reconic.c:200:config_rn_dev_axib_bdf(): [BDF] AXIB_BDF_MAP_CONTROL_ADDR=0x16510, bdf_win_config=0xc2000000
Info: CREATE RDMA DEVICE
/users/qianyich/RecoNIC/lib/reconic.c:146:get_buffer_paddr(): Info: get_buffer_paddr - Page frame: 0x2f6b400
/users/qianyich/RecoNIC/lib/reconic.c:151:get_buffer_paddr(): Info: get_buffer_paddr - distance from page boundary: 0x0
/users/qianyich/RecoNIC/lib/reconic.c:155:get_buffer_paddr(): Info: get_buffer_paddr - Physical address of buffer: 0x2f6b400000
/users/qianyich/RecoNIC/lib/reconic.c:232:allocate_rdma_buffer(): Info: allocated host buffer vir addr = 0x7f93bf000000, physical addr = 2f6b400000, rn_dev->buffer_offset = 0x200000
/users/qianyich/RecoNIC/lib/reconic.c:233:allocate_rdma_buffer(): Info: allocate_rdma_buffer - successfully allocated rdma host buffer
/users/qianyich/RecoNIC/lib/reconic.c:146:get_buffer_paddr(): Info: get_buffer_paddr - Page frame: 0x2f6b600
/users/qianyich/RecoNIC/lib/reconic.c:151:get_buffer_paddr(): Info: get_buffer_paddr - distance from page boundary: 0x0
/users/qianyich/RecoNIC/lib/reconic.c:155:get_buffer_paddr(): Info: get_buffer_paddr - Physical address of buffer: 0x2f6b600000
/users/qianyich/RecoNIC/lib/reconic.c:232:allocate_rdma_buffer(): Info: allocated host buffer vir addr = 0x7f93bf200000, physical addr = 2f6b600000, rn_dev->buffer_offset = 0x1200000
/users/qianyich/RecoNIC/lib/reconic.c:233:allocate_rdma_buffer(): Info: allocate_rdma_buffer - successfully allocated rdma host buffer
/users/qianyich/RecoNIC/lib/reconic.c:146:get_buffer_paddr(): Info: get_buffer_paddr - Page frame: 0x2f6a600
/users/qianyich/RecoNIC/lib/reconic.c:151:get_buffer_paddr(): Info: get_buffer_paddr - distance from page boundary: 0x0
/users/qianyich/RecoNIC/lib/reconic.c:155:get_buffer_paddr(): Info: get_buffer_paddr - Physical address of buffer: 0x2f6a600000
/users/qianyich/RecoNIC/lib/reconic.c:232:allocate_rdma_buffer(): Info: allocated host buffer vir addr = 0x7f93c0200000, physical addr = 2f6a600000, rn_dev->buffer_offset = 0x1202000
/users/qianyich/RecoNIC/lib/reconic.c:233:allocate_rdma_buffer(): Info: allocate_rdma_buffer - successfully allocated rdma host buffer
/users/qianyich/RecoNIC/lib/reconic.c:146:get_buffer_paddr(): Info: get_buffer_paddr - Page frame: 0x2f6a602
/users/qianyich/RecoNIC/lib/reconic.c:151:get_buffer_paddr(): Info: get_buffer_paddr - distance from page boundary: 0x0
/users/qianyich/RecoNIC/lib/reconic.c:155:get_buffer_paddr(): Info: get_buffer_paddr - Physical address of buffer: 0x2f6a602000
/users/qianyich/RecoNIC/lib/reconic.c:232:allocate_rdma_buffer(): Info: allocated host buffer vir addr = 0x7f93c0202000, physical addr = 2f6a602000, rn_dev->buffer_offset = 0x1212000
/users/qianyich/RecoNIC/lib/reconic.c:233:allocate_rdma_buffer(): Info: allocate_rdma_buffer - successfully allocated rdma host buffer
/users/qianyich/RecoNIC/lib/reconic.c:146:get_buffer_paddr(): Info: get_buffer_paddr - Page frame: 0x2f6a612
/users/qianyich/RecoNIC/lib/reconic.c:151:get_buffer_paddr(): Info: get_buffer_paddr - distance from page boundary: 0x0
/users/qianyich/RecoNIC/lib/reconic.c:155:get_buffer_paddr(): Info: get_buffer_paddr - Physical address of buffer: 0x2f6a612000
/users/qianyich/RecoNIC/lib/reconic.c:232:allocate_rdma_buffer(): Info: allocated host buffer vir addr = 0x7f93c0212000, physical addr = 2f6a612000, rn_dev->buffer_offset = 0x1222000
/users/qianyich/RecoNIC/lib/reconic.c:233:allocate_rdma_buffer(): Info: allocate_rdma_buffer - successfully allocated rdma host buffer
Info: OPEN RDMA DEVICE
/users/qianyich/RecoNIC/lib/rdma_api.c:186:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_DATBUFBA=0x600a0, value=0x6b600000
/users/qianyich/RecoNIC/lib/rdma_api.c:188:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_DATBUFBAMSB=0x600a4, value=0x2f
/users/qianyich/RecoNIC/lib/rdma_api.c:190:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_DATBUFSZ=0x600a8, value=0x10001000
/users/qianyich/RecoNIC/lib/rdma_api.c:193:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_IPKTERRQBA=0x60088, value=0x6a600000
/users/qianyich/RecoNIC/lib/rdma_api.c:195:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_IPKTERRQBAMSB=0x6008c, value=0x2f
/users/qianyich/RecoNIC/lib/rdma_api.c:197:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_ERRBUFSZ=0x60090, value=0x2000
/users/qianyich/RecoNIC/lib/rdma_api.c:200:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_ERRBUFBA=0x60060, value=0x6a602000
/users/qianyich/RecoNIC/lib/rdma_api.c:202:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_ERRBUFBAMSB=0x60064, value=0x2f
/users/qianyich/RecoNIC/lib/rdma_api.c:204:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_ERRBUFSZ=0x60068, value=0x1000100
/users/qianyich/RecoNIC/lib/rdma_api.c:207:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_RESPERRPKTBA=0x600b0, value=0x6a612000
/users/qianyich/RecoNIC/lib/rdma_api.c:209:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_RESPERRPKTBAMSB=0x600b4, value=0x2f
/users/qianyich/RecoNIC/lib/rdma_api.c:211:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_RESPERRSZ=0x600b8, value=0x10000
/users/qianyich/RecoNIC/lib/rdma_api.c:213:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_RESPERRSZMSB=0x600bc, value=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:217:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_INTEN=0x60180, value=0xff
/users/qianyich/RecoNIC/lib/rdma_api.c:221:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_MACXADDLSB=0x60010, value=0x35546002
/users/qianyich/RecoNIC/lib/rdma_api.c:223:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_MACXADDMSB=0x60014, value=0xa
/users/qianyich/RecoNIC/lib/rdma_api.c:227:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_IPV4XADD=0x60070, value=0xc0643301
/users/qianyich/RecoNIC/lib/rdma_api.c:230:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_XRNICCONF=0x60000, value=0x56ce0821
/users/qianyich/RecoNIC/lib/rdma_api.c:233:config_rdma_global_csr(): [Register] RN_RDMA_GCSR_XRNICADCONF=0x60004, value=0xa0004
Info: RDMA global control status registers are configured.
Info: rdma_dev opened
Info: ALLOCATE PD
/users/qianyich/RecoNIC/lib/rdma_api.c:254:allocate_rdma_pd(): [Register] RN_RDMA_PDT_PDPDNUM=0x40000, pd_num=0, value=0x0
Info: OPEN DEVICE FILE
Info: ALLOCATE RDMA QP
Allocating qp->sq
/users/qianyich/RecoNIC/lib/rdma_api.c:437:allocate_rdma_qp(): sq_size = 32768, cq_size = 2048, rq_size 262144, buf_location = host_mem
/users/qianyich/RecoNIC/lib/reconic.c:146:get_buffer_paddr(): Info: get_buffer_paddr - Page frame: 0x2f6a622
/users/qianyich/RecoNIC/lib/reconic.c:151:get_buffer_paddr(): Info: get_buffer_paddr - distance from page boundary: 0x0
/users/qianyich/RecoNIC/lib/reconic.c:155:get_buffer_paddr(): Info: get_buffer_paddr - Physical address of buffer: 0x2f6a622000
/users/qianyich/RecoNIC/lib/reconic.c:232:allocate_rdma_buffer(): Info: allocated host buffer vir addr = 0x7f93c0222000, physical addr = 2f6a622000, rn_dev->buffer_offset = 0x122a000
/users/qianyich/RecoNIC/lib/reconic.c:233:allocate_rdma_buffer(): Info: allocate_rdma_buffer - successfully allocated rdma host buffer
Allocating qp->cq
/users/qianyich/RecoNIC/lib/reconic.c:146:get_buffer_paddr(): Info: get_buffer_paddr - Page frame: 0x2f6a62a
/users/qianyich/RecoNIC/lib/reconic.c:151:get_buffer_paddr(): Info: get_buffer_paddr - distance from page boundary: 0x0
/users/qianyich/RecoNIC/lib/reconic.c:155:get_buffer_paddr(): Info: get_buffer_paddr - Physical address of buffer: 0x2f6a62a000
/users/qianyich/RecoNIC/lib/reconic.c:232:allocate_rdma_buffer(): Info: allocated host buffer vir addr = 0x7f93c022a000, physical addr = 2f6a62a000, rn_dev->buffer_offset = 0x122a800
/users/qianyich/RecoNIC/lib/reconic.c:233:allocate_rdma_buffer(): Info: allocate_rdma_buffer - successfully allocated rdma host buffer
Allocating qp->rq
/users/qianyich/RecoNIC/lib/reconic.c:146:get_buffer_paddr(): Info: get_buffer_paddr - Page frame: 0x2f6a62b
/users/qianyich/RecoNIC/lib/reconic.c:151:get_buffer_paddr(): Info: get_buffer_paddr - distance from page boundary: 0x0
/users/qianyich/RecoNIC/lib/reconic.c:155:get_buffer_paddr(): Info: get_buffer_paddr - Physical address of buffer: 0x2f6a62b000
/users/qianyich/RecoNIC/lib/reconic.c:232:allocate_rdma_buffer(): Info: allocated host buffer vir addr = 0x7f93c022b000, physical addr = 2f6a62b000, rn_dev->buffer_offset = 0x126b000
/users/qianyich/RecoNIC/lib/reconic.c:233:allocate_rdma_buffer(): Info: allocate_rdma_buffer - successfully allocated rdma host buffer
Info: queue pair setting is done! Configuring RDMA per-queu CSR registers
/users/qianyich/RecoNIC/lib/rdma_api.c:487:allocate_rdma_qp(): DEBUG: rdma_dev->rn_dev->axil_ctl = 0x7f93df1fa000, rdma_dev->axil_ctl = 0x7f93df1fa000
/users/qianyich/RecoNIC/lib/rdma_api.c:502:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_IPDESADDR1i=0x60360, qpid=2, value=0xc0643401
/users/qianyich/RecoNIC/lib/rdma_api.c:509:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_MACDESADDLSBi=0x60350, qpid=2, value=0x35f781e1
/users/qianyich/RecoNIC/lib/rdma_api.c:516:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_MACDESADDMSBi=0x60354, qpid=2, value=0xa
/users/qianyich/RecoNIC/lib/rdma_api.c:521:allocate_rdma_qp(): DEBUG: win_size_high = 0xff, win_size_low = 0xffffffff
/users/qianyich/RecoNIC/lib/rdma_api.c:539:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_SQBAi=0x60310, qpid=2, value=0x6a622000
/users/qianyich/RecoNIC/lib/rdma_api.c:546:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_SQBAMSBi=0x603c8, qpid=2, value=0x2f
/users/qianyich/RecoNIC/lib/rdma_api.c:550:allocate_rdma_qp(): DEBUG: qp->sq->dma_addr = 0x2f6a622000, sq_addr_msb = 0x2f, sq_addr_lsb = 0x6a622000
/users/qianyich/RecoNIC/lib/rdma_api.c:568:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_CQBAi=0x60318, qpid=2, value=0x6a62a000
/users/qianyich/RecoNIC/lib/rdma_api.c:575:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_CQBAMSBi=0x603d0, qpid=2, value=0x2f
/users/qianyich/RecoNIC/lib/rdma_api.c:579:allocate_rdma_qp(): DEBUG: qp->cq->dma_addr = 0x2f6a62a000, cq_addr_msb = 0x2f, cq_addr_lsb = 0x6a62a000
/users/qianyich/RecoNIC/lib/rdma_api.c:597:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_RQBAi=0x60308, qpid=2, value=0x6a62b000
/users/qianyich/RecoNIC/lib/rdma_api.c:604:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_RQBAMSBi=0x603c0, qpid=2, value=0x2f
/users/qianyich/RecoNIC/lib/rdma_api.c:608:allocate_rdma_qp(): DEBUG: qp->rq->dma_addr = 0x2f6a62b000, rq_addr_msb = 0x2f, rq_addr_lsb = 0x6a62b000
/users/qianyich/RecoNIC/lib/rdma_api.c:617:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_CQDBADDi=0x60328, qpid=2, value=0x6b400000
/users/qianyich/RecoNIC/lib/rdma_api.c:624:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_CQDBADDMSBi=0x6032c, qpid=2, value=0x2f
/users/qianyich/RecoNIC/lib/rdma_api.c:625:allocate_rdma_qp(): DEBUG: cq_cidb_addr = 0x2f6b400000
/users/qianyich/RecoNIC/lib/rdma_api.c:634:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_RQWPTRDBADDi=0x60320, qpid=2, value=0x6b400020
/users/qianyich/RecoNIC/lib/rdma_api.c:641:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_RQWPTRDBADDMSBi=0x60324, qpid=2, value=0x2f
/users/qianyich/RecoNIC/lib/rdma_api.c:642:allocate_rdma_qp(): DEBUG: rq_cidb_addr = 0x2f6b400020
/users/qianyich/RecoNIC/lib/rdma_api.c:651:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_DESTQPCONFi=0x60348, qpid=2, value=0x2
/users/qianyich/RecoNIC/lib/rdma_api.c:660:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_QDEPTHi=0x6033c, qpid=2, value=0x400040
/users/qianyich/RecoNIC/lib/rdma_api.c:707:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_QPCONFi=0x60300, qpid=2, value=0x200043d
/users/qianyich/RecoNIC/lib/rdma_api.c:721:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_QPADVCONFi=0x60304, qpid=2, value=0x12344000
/users/qianyich/RecoNIC/lib/rdma_api.c:730:allocate_rdma_qp(): [Register] RN_RDMA_QCSR_PDi=0x603b0, qpid=2, value=0x0
Info: allocate_rdma_qp - Successfully allocated a rdma qp
Info: CONFIGURE PSN
[Register] RN_RDMA_QCSR_LSTRQREQi=0x60344, qpid=2, value=0xa000abc
[Register] RN_RDMA_QCSR_SQPSNi=0x60340, qpid=2, value=0xabd
payload_size = 128, payload_size>>2 = 32
Info: Client is connecting to a remote server
Info: Client is connected to a remote server
Info: client received remote offset of A = 0xa350000000000000
/users/qianyich/RecoNIC/lib/reconic.c:256:allocate_rdma_buffer(): Info: allocated device buffer physical addr = a350000000000000, rn_dev->dev_buffer_offset = 0x80
/users/qianyich/RecoNIC/lib/reconic.c:258:allocate_rdma_buffer(): Info: allocate_rdma_buffer - successfully allocated rdma device buffer
Info: creating an RDMA read WQE for getting data
/users/qianyich/RecoNIC/lib/rdma_api.c:769:create_a_wqe(): Info: WQE mem_buffer = 0xa350000000000000, masked_mem_buffer = 0xa350000000000000
/users/qianyich/RecoNIC/lib/rdma_api.c:796:create_a_wqe(): [WQE] wrid=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:797:create_a_wqe(): [WQE] laddr_low=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:798:create_a_wqe(): [WQE] laddr_high=0xa3500000
/users/qianyich/RecoNIC/lib/rdma_api.c:799:create_a_wqe(): [WQE] length=0x80
/users/qianyich/RecoNIC/lib/rdma_api.c:800:create_a_wqe(): [WQE] opcode=0x4
/users/qianyich/RecoNIC/lib/rdma_api.c:801:create_a_wqe(): [WQE] remote_offset_low=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:802:create_a_wqe(): [WQE] remote_offset_high=0xa3500000
/users/qianyich/RecoNIC/lib/rdma_api.c:803:create_a_wqe(): [WQE] r_key=0x8
/users/qianyich/RecoNIC/lib/rdma_api.c:804:create_a_wqe(): [WQE] send_small_payload0=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:805:create_a_wqe(): [WQE] send_small_payload1=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:806:create_a_wqe(): [WQE] send_small_payload2=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:807:create_a_wqe(): [WQE] send_small_payload3=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:808:create_a_wqe(): [WQE] immdt_data=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:875:rdma_post_send(): DEBUG: Reading hardware SQPIi (0x60338) = 0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:876:rdma_post_send(): DEBUG: original qp->sq_pidb = 0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:882:rdma_post_send(): [Register] RN_RDMA_QCSR_SQPIi=0x60338, qpid=2, value=0x1
/users/qianyich/RecoNIC/lib/rdma_api.c:883:rdma_post_send(): DEBUG: Update hardware sq db idx from software = 1
/users/qianyich/RecoNIC/lib/rdma_api.c:884:rdma_post_send(): DEBUG: Reading hardware SQPIi (0x60338) = 0x1
/users/qianyich/RecoNIC/lib/rdma_api.c:844:poll_cq_cidb(): [Register] RN_RDMA_QCSR_CQHEADi=0x60330, qpid=2, value=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:846:poll_cq_cidb(): DEBUG: before polling: sq_cidb = 0; Polling CQ CIDB = 0
/users/qianyich/RecoNIC/lib/rdma_api.c:857:poll_cq_cidb(): DEBUG: after polling: sq_cidb = 0; Polling CQ CIDB = 1
Successfully sent an RDMA read operation
Info: Dump register values for debug purpose
Info: [RN_RDMA_GCSR_ERRBUFWPTR = 0x6006c] = 0x0
Info: [RN_RDMA_GCSR_IPKTERRQWPTR = 0x60094] = 0x0
Info: [RN_RDMA_GCSR_INSRRPKTCNT = 0x60100] = 0x20004
Info: [RN_RDMA_GCSR_INAMPKTCNT = 0x60104] = 0x1
Info: [RN_RDMA_GCSR_OUTIOPKTCNT = 0x60108] = 0x40000
Info: [RN_RDMA_GCSR_OUTAMPKTCNT = 0x6010c] = 0x1
Info: [RN_RDMA_GCSR_LSTINPKT = 0x60110] = 0xabd0210
Info: [RN_RDMA_GCSR_LSTOUTPKT = 0x60114] = 0x157a204
Info: [RN_RDMA_GCSR_ININVDUPCNT = 0x60118] = 0x0
Info: [RN_RDMA_GCSR_INNCKPKTSTS = 0x6011c] = 0x0
Info: [RN_RDMA_GCSR_OUTRNRPKTSTS = 0x60120] = 0x0
Info: [RN_RDMA_GCSR_WQEPROCSTS = 0x60124] = 0x12122000
Info: [RN_RDMA_GCSR_QPMSTS = 0x6012c] = 0x40002
Info: [RN_RDMA_GCSR_INALLDRPPKTCNT = 0x60130] = 0xe0000
Info: [RN_RDMA_GCSR_INNAKPKTCNT = 0x60134] = 0x0
Info: [RN_RDMA_GCSR_OUTNAKPKTCNT = 0x60138] = 0x0
Info: [RN_RDMA_GCSR_RESPHNDSTS = 0x6013c] = 0x10f02
Info: [RN_RDMA_GCSR_RETRYCNTSTS = 0x60140] = 0x0
Info: [RN_RDMA_GCSR_INCNPPKTCNT = 0x60174] = 0x0
Info: [RN_RDMA_GCSR_OUTCNPPKTCNT = 0x60178] = 0x0
Info: [RN_RDMA_GCSR_OUTRDRSPPKTCNT = 0x6017c] = 0x6
Info: [RN_RDMA_GCSR_INTSTS = 0x60184] = 0x10
Info: [RN_RDMA_GCSR_RQINTSTS1 = 0x60190] = 0x0
Info: [RN_RDMA_GCSR_RQINTSTS2 = 0x60194] = 0x0
Info: [RN_RDMA_GCSR_RQINTSTS3 = 0x60198] = 0x0
Info: [RN_RDMA_GCSR_RQINTSTS4 = 0x6019c] = 0x0
Info: [RN_RDMA_GCSR_RQINTSTS5 = 0x601a0] = 0x0
Info: [RN_RDMA_GCSR_RQINTSTS6 = 0x601a4] = 0x0
Info: [RN_RDMA_GCSR_RQINTSTS7 = 0x601a8] = 0x0
Info: [RN_RDMA_GCSR_RQINTSTS8 = 0x601ac] = 0x0
Info: [RN_RDMA_GCSR_CQINTSTS1 = 0x601b0] = 0x4
Info: [RN_RDMA_GCSR_CQINTSTS2 = 0x601b4] = 0x0
Info: [RN_RDMA_GCSR_CQINTSTS3 = 0x601b8] = 0x0
Info: [RN_RDMA_GCSR_CQINTSTS4 = 0x601bc] = 0x0
Info: [RN_RDMA_GCSR_CQINTSTS5 = 0x601c0] = 0x0
Info: [RN_RDMA_GCSR_CQINTSTS6 = 0x601c4] = 0x0
Info: [RN_RDMA_GCSR_CQINTSTS7 = 0x601c8] = 0x0
Info: [RN_RDMA_GCSR_CQINTSTS8 = 0x601cc] = 0x0
Info: [RN_RDMA_QCSR_CQHEADi = 0x60330] = 0x1
Info: [RN_RDMA_QCSR_STATSSNi = 0x60380] = 0x4
Info: [RN_RDMA_QCSR_STATMSNi = 0x60384] = 0x0
Info: [RN_RDMA_QCSR_STATQPi = 0x60388] = 0x1f0600
Info: [RN_RDMA_QCSR_STATCURSQPTRi = 0x6038c] = 0x1
Info: [RN_RDMA_QCSR_STATRESPSNi = 0x60390] = 0xabd
Info: [RN_RDMA_QCSR_STATRQBUFCAi = 0x60394] = 0x6a62b000
Info: [RN_RDMA_QCSR_STATWQEi = 0x60398] = 0x0
Info: [RN_RDMA_QCSR_STATRQPIDBi = 0x6039c] = 0x0
Info: [RN_RDMA_QCSR_STATRQBUFCAMSBi = 0x603d8] = 0x2f
Info: [RN_RDMA_QCSR_SQPIi = 0x60338] = 0x1
Info: All data has been received!
Info: buffer physical address is 0xa350000000000000
Info: Time spent 8.531000 usec, size = 128 bytes, Bandwidth = 0.120033 gigabits/sec
Info: The value of rc is 128
Info: CHECK RECEIVED DATA
Error: received data mismatched: recv[0]=541065216, sw_golden[0]=0
/users/qianyich/RecoNIC/lib/rdma_api.c:1088:destroy_rdma_qp(): [DEBUG] Destroying dev: 0x7f93df1fa000, RN_RDMA_QCSR_CQHEADi=0x60330, qpid=2, value=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:1093:destroy_rdma_qp(): [DEBUG] Destroying dev: 0x7f93df1fa000, RN_RDMA_QCSR_CQHEADi=0x60330, qpid=2, value=0x0
/users/qianyich/RecoNIC/lib/rdma_api.c:1101:destroy_rdma_qp(): [DEBUG] Destroying dev: 0x7f93df1fa000, RN_RDMA_QCSR_CQHEADi=0x60330, qpid=2, value=0x0
@zhguanw-amd I think, at this point, it is just some bugs in the test programs. RecoNIC is a bit fragile, and any small mistakes in the code could trap the device in an erroneous state, and the pain is that it can never be recovered unless we reprogram the board. BTW, I think the hardware design is ok on U280. We can merge that into the repo if you want these days.
Commenting out the if block in onic_main.c did work. Just let you know if you want to change it in the main branch.
@qianyich QP fatal recovery is a bug in the rdma IP, which is fixed in 4.0 version. I'll push this newer version when I have time.
The above QDMA issue is related to QDMA MM and ST channels mapping. You can revert back to the onic-driver in commit "9e4f0b74bc69744d6d807115e9a23705ba967dbb". But in this commit, there will be around 2-3% ping packet loss due to pid assigned to netdev exceeding 64 occasionally. The latter pushes are used to fix this isse, but seems introducing other qdma problems.
@zhguanw-amd Should I just wait for the new release?
I found that QPN, PSN, and rkey are hard coded in the tests. Does RecoNIC provide APIs that generate QPN, rkey, PSN, etc? I looked into the APIs, and probably the answer is no. RNICs from other vendors usually have their own algorithms to generate QPN, etc. Otherwise, it could be insecure, although it is still insecure with some algorithms.
@qianyich For the public release with new RDMA IP, it might be around June or July this year, as we need to upgrade Vivado version to support the new IP. And we also need to change current QDMA due to Vivado upgrade.
Regarding QPN, PSN and rkey, what we provide in the tests is just a showcase to demonstrate how to use RecoNIC and its RDMA via user-space APIs. Designers or users can change it according to their requirements. For example, security concerns you have. I'll leave this to developers if using libreconic.
More standard usage would be to go through RDMA-core library, which abstracts those variables from users. We have a version at the moment. But for public release, it would take longer time.
It seems most of the issues have been addressed. I'm going to close this thread. Feel free to open another threads if you have more questions.
Fixed QDMA Queue mapping issue with the commit-a389dd
Fixed large host memory mapping (>128GB, up to 1TB host memory) issue with the commit-28f467
Please use the latest commit, as it contains enhancement or other fixes as well.
The system is up and running. I can ping server from client side and ping client from server side.
When I was trying to run rdma_test read and write, I have the following error.
I found this in lib/reconic.c:323. Is this due to insufficient huge page? I guess I need to enable and configure the number of huge pages in Linux. How many huge pages do I need?
Currently:
After configure the hugepage number to 1024. I have the following error:
This time looks like the error is from the read application at line 319,
rc = read(sockfd, &read_A_offset, sizeof(read_A_offset));
returns a value that is not over 0. And I am kind of confused with why socket is involved here? My understanding is that RDMA has nothing to do with socket.