Xilinx / Vitis_Accel_Examples

Vitis_Accel_Examples
http://xilinx.github.io/Vitis_Accel_Examples/
MIT License
496 stars 204 forks source link

Deadlock within rtl_vadd_2clks example, when setting DATA_SIZE to 1025 in host.cpp #23

Closed NicolasBondouxA closed 3 years ago

NicolasBondouxA commented 4 years ago

Hello,

How to reproduce: I took the code from the 2019.2 branch and compiled with a 2019.2 Vitis.

The problem was observed xilinx_u200_xdma_201830_2 platform, and with XRT 2020.1. When changing in the host the DATA_SIZE to 1025 and running several times the example, the process will eventually get locked (most often, two runs are enough; the first run always succeed).

Here is the xbutil query trace:

INFO: Found total 1 card(s), 1 are usable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
System Configuration
OS name:    Linux
Release:    3.10.0-1127.13.1.el7.x86_64
Version:    #1 SMP Fri Jun 12 14:34:17 EDT 2020
Machine:    x86_64
Model:      PowerEdge T630
CPU cores:  24
Memory:     257693 MB
Glibc:      2.27
Distribution:   Ubuntu 18.04.5 LTS
Now:        Wed Aug 26 15:36:22 2020
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
XRT Information
Version:    2.6.655
Git Hash:   2d6bfe4ce91051d4e5b499d38fc493586dd4859a
Git Branch: 2020.1
Build Date: 2020-05-22 12:05:03
XOCL:       2.6.655,2d6bfe4ce91051d4e5b499d38fc493586dd4859a
XCLMGMT:    2.6.655,2d6bfe4ce91051d4e5b499d38fc493586dd4859a

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Shell                           FPGA                            IDCode
xilinx_u200_xdma_201830_2       xcu200-fsgd2104-2-e             0x14b37093
Vendor          Device          SubDevice       SubVendor       SerNum          
0x10ee          0x5001          0x000e          0x10ee          2129048AJ017    
DDR size        DDR count       Clock0          Clock1          Clock2          
64 GB           4               150             250             0               
PCIe            DMA chan(bidir) MIG Calibrated  P2P Enabled     OEM ID          
GEN 3x16        2               true            false           0xc6f50640(N/A) 
DNA

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Temperature(C)
PCB TOP FRONT   PCB TOP REAR    PCB BTM FRONT   VCCINT TEMP     
41              35              40              N/A             
FPGA TEMP       TCRIT Temp      FAN Presence    FAN Speed(RPM)  
42              40              A               1110            
QSFP 0          QSFP 1          QSFP 2          QSFP 3          
N/A             N/A             N/A             N/A             
HBM TEMP        
N/A             
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Electrical(mV|mA)
12V PEX         12V AUX         12V PEX Current 12V AUX Current 
12265           12257           1124            969             
3V3 PEX         3V3 AUX         DDR VPP BOTTOM  DDR VPP TOP     
3363            3300            2500            2500            
SYS 5V5         1V2 TOP         1V8 TOP         0V85            
5471            1206            1835            857             
MGT 0V9         12V SW          MGT VTT         1V2 BTM         
909             12190           1202            1202            
VCCINT VOL      VCCINT CURR     VCCINT IO VOL   VCC3V3 VOL      
851             12240           N/A             N/A             
3V3 PEX CURR    VCCINT IO CURR  HBM1V2 VOL      VPP2V5 VOL      
N/A             N/A             N/A             N/A             
VCC1V2 CURR     V12 I CURR      V12 AUX0 CURR   V12 AUX1 CURR   
N/A             N/A             N/A             N/A             
12V AUX1 VOL    VCCAUX VOL      VCCAUX PMC VOL  VCCRAM VOL      
N/A             N/A             N/A             N/A             
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Card Power(W)
25
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Firewall Last Error Status
Level 0 : 0x0(GOOD)

ECC Error Status
Tag     Errors      CE Count  UE Count  CE FFA              UE FFA              
bank1   (None)      0         0         0x0                 0x0                 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memory Status
     Tag         Type        Temp(C)  Size    Mem Usage       BO count
[ 0] bank0       **UNUSED**  33       16 GB   0 Byte          0       
[ 1] bank1       MEM_DDR4    36       16 GB   24576 Byte      3       
[ 2] bank2       **UNUSED**  39       16 GB   0 Byte          0       
[ 3] bank3       **UNUSED**  36       16 GB   0 Byte          0       
[ 4] PLRAM[0]    **UNUSED**  N/A      128 KB  0 Byte          0       
[ 5] PLRAM[1]    **UNUSED**  N/A      128 KB  0 Byte          0       
[ 6] PLRAM[2]    **UNUSED**  N/A      128 KB  0 Byte          0       
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DMA Transfer Metrics
Chan[0].h2c:  17175 KB
Chan[0].c2h:  272 KB
Chan[1].h2c:  8200 Byte
Chan[1].c2h:  0 Byte
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Streams
     Tag         Flow ID  Route ID Status   Total (B/#)     Pending (B/#)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Xclbin UUID
7f4e0c0e-8c43-4b75-9c57-a4a495b6d6fb
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compute Unit Status
CU[ 0]: krnl_vadd_rtl:krnl_vadd_rtl_1   @0x1800000         (START)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
INFO: xbutil query succeeded.
nbondoux@46a6ad77bae8:/projects/nbondoux/erbium_master/xilinx_work$ xbutil reset
All existing processes will be killed.
Are you sure you wish to proceed? [y/n]: y
nbondoux@46a6ad77bae8:/projects/nbondoux/erbium_master/xilinx_work$ xbutil query
INFO: Found total 1 card(s), 1 are usable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
System Configuration
OS name:    Linux
Release:    3.10.0-1127.13.1.el7.x86_64
Version:    #1 SMP Fri Jun 12 14:34:17 EDT 2020
Machine:    x86_64
Model:      PowerEdge T630
CPU cores:  24
Memory:     257693 MB
Glibc:      2.27
Distribution:   Ubuntu 18.04.5 LTS
Now:        Wed Aug 26 15:37:27 2020
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
XRT Information
Version:    2.6.655
Git Hash:   2d6bfe4ce91051d4e5b499d38fc493586dd4859a
Git Branch: 2020.1
Build Date: 2020-05-22 12:05:03
XOCL:       2.6.655,2d6bfe4ce91051d4e5b499d38fc493586dd4859a
XCLMGMT:    2.6.655,2d6bfe4ce91051d4e5b499d38fc493586dd4859a

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Shell                           FPGA                            IDCode
xilinx_u200_xdma_201830_2       xcu200-fsgd2104-2-e             0x14b37093
Vendor          Device          SubDevice       SubVendor       SerNum          
0x10ee          0x5001          0x000e          0x10ee          2129048AJ017    
DDR size        DDR count       Clock0          Clock1          Clock2          
64 GB           4               150             250             0               
PCIe            DMA chan(bidir) MIG Calibrated  P2P Enabled     OEM ID          
GEN 3x16        2               true            false           0x200f0640(N/A) 
DNA

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Temperature(C)
PCB TOP FRONT   PCB TOP REAR    PCB BTM FRONT   VCCINT TEMP     
41              35              41              N/A             
FPGA TEMP       TCRIT Temp      FAN Presence    FAN Speed(RPM)  
42              40              A               1110            
QSFP 0          QSFP 1          QSFP 2          QSFP 3          
N/A             N/A             N/A             N/A             
HBM TEMP        
N/A             
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Electrical(mV|mA)
12V PEX         12V AUX         12V PEX Current 12V AUX Current 
12242           12286           1152            961             
3V3 PEX         3V3 AUX         DDR VPP BOTTOM  DDR VPP TOP     
3366            3303            2500            2500            
SYS 5V5         1V2 TOP         1V8 TOP         0V85            
5496            1203            1842            855             
MGT 0V9         12V SW          MGT VTT         1V2 BTM         
908             12201           1203            1203            
VCCINT VOL      VCCINT CURR     VCCINT IO VOL   VCC3V3 VOL      
851             12306           N/A             N/A             
3V3 PEX CURR    VCCINT IO CURR  HBM1V2 VOL      VPP2V5 VOL      
N/A             N/A             N/A             N/A             
VCC1V2 CURR     V12 I CURR      V12 AUX0 CURR   V12 AUX1 CURR   
N/A             N/A             N/A             N/A             
12V AUX1 VOL    VCCAUX VOL      VCCAUX PMC VOL  VCCRAM VOL      
N/A             N/A             N/A             N/A             
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Card Power(W)
25
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Firewall Last Error Status
Level 0 : 0x0(GOOD)

ECC Error Status
Tag     Errors      CE Count  UE Count  CE FFA              UE FFA              
bank1   (None)      0         0         0x0                 0x0                 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memory Status
     Tag         Type        Temp(C)  Size    Mem Usage       BO count
[ 0] bank0       **UNUSED**  33       16 GB   0 Byte          0       
[ 1] bank1       MEM_DDR4    36       16 GB   24576 Byte      3       
[ 2] bank2       **UNUSED**  39       16 GB   0 Byte          0       
[ 3] bank3       **UNUSED**  36       16 GB   0 Byte          0       
[ 4] PLRAM[0]    **UNUSED**  N/A      128 KB  0 Byte          0       
[ 5] PLRAM[1]    **UNUSED**  N/A      128 KB  0 Byte          0       
[ 6] PLRAM[2]    **UNUSED**  N/A      128 KB  0 Byte          0       
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DMA Transfer Metrics
Chan[0].h2c:  8200 Byte
Chan[0].c2h:  4100 Byte
Chan[1].h2c:  8200 Byte
Chan[1].c2h:  0 Byte
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Streams
     Tag         Flow ID  Route ID Status   Total (B/#)     Pending (B/#)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Xclbin UUID
7f4e0c0e-8c43-4b75-9c57-a4a495b6d6fb
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compute Unit Status
CU[ 0]: krnl_vadd_rtl:krnl_vadd_rtl_1   @0x1800000         (START)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
INFO: xbutil query succeeded.

Thanks, Nicolas

virata-xilinx commented 4 years ago

Hi,

Since you are using the 2019.2 branch, can you please once try sourcing the 2019.2 Vitis and XRT. Also, is the original example working for you (with DATA_SIZE = 256)?

NicolasBondouxA commented 4 years ago

Hi;

The original example works for any other value for DATA_SIZE than 1025 !

I could also reproduce the problem on AWS F1, which uses Vitis and XRT 2019.2. On F1, the process [xocl-scheduler-] uses 100% during the second run, that is blocked ; anysubsequent run will then fail.

Actually, I have that concern because my project make usage of this code example, but with random batch size; and I observed deadlocks for certain batch size. So, I tested rtl_vadd_2clks for different batch sizes and ran into the same issue. I think the problem may happen when the size of the buffer written by the FPGA is just a bit more than a XDMA burst.

virata-xilinx commented 4 years ago

Hi,

I tried to reproduce the issue. I ran the example for hw_emu u200_xdma_201830_2 after sourcing 2019.2 VITIS and XRT. I ran the example thrice but was not able to reproduce the issue. Can you please mention the steps taken by you .

NicolasBondouxA commented 4 years ago

Hi;

I forgot to mention that the problem does not happen in hw emulation, but only with TARGET=hw. After changing the DATA_SIZE to 1025 in host.cpp, I do:

make all TARGET=hw DEVICE=$myDevice

make check TARGET=hw DEVICE=$myDevice

and that's it Thanks,

Nicolas

heeran-xilinx commented 3 years ago

Hi @NicolasBondouxA , could you please post your query to Xilinx Forum incase you are still facing the same issue? https://forums.xilinx.com/t5/Vitis-Acceleration-SDAccel-SDSoC/bd-p/tools_v

-Heera