Xilinx / XRT

Run Time for AIE and FPGA based platforms
https://xilinx.github.io/XRT
Other
559 stars 476 forks source link

Potential infinite loop in unix_socket::sk_read #6180

Open lforg37 opened 2 years ago

lforg37 commented 2 years ago

It seems that unix_socket::sk_read in runtime_src/core/pcie/emulation/common_em/unix_socket.cxx does not take into account the possibility of having less data on the socket than required.

The (r = read(fd, buf + rlen, count - rlen)) < 0 condition will never be reached if the socket is closed (0 would be assigned to r) producing an infinite loop.

This behaviour has been observed on standard code. I have not found why the socket sometimes contains less information than expected. The same program can freeze or not depending on the execution so it seems there is a race condition here.

XRT version : 4c83637fd4d4041a5cd4872a1391f812e54e143e Alveo platform : xilinx_u200_gen3x16_xdma_1_202110_1

stack trace when blocked :

#1  __GI___libc_read (fd=8, buf=0x55555559e200, nbytes=9) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007ffff7085701 in unix_socket::sk_read (this=0x555555593d60, rbuf=0x55555559e200, count=9)
    at XRT/src/runtime_src/core/pcie/emulation/common_em/unix_socket.cxx:131
#3  0x00007ffff7020f2f in xclhwemhal2::HwEmShim::xclFreeDeviceBuffer (this=0x55555558e0b0, offset=34359742464, sendtoxsim=true)
    at XRT/src/runtime_src/core/pcie/emulation/hw_em/generic_pcie_hal2/shim.cxx:1665
#4  0x00007ffff702d322 in xclhwemhal2::HwEmShim::xclFreeBO (this=0x55555558e0b0, boHandle=2)
    at XRT/src/runtime_src/core/pcie/emulation/hw_em/generic_pcie_hal2/shim.cxx:3128
#5  0x00007ffff6ff623d in operator() (__closure=0x7fffffffd800)
    atXRT/src/runtime_src/core/pcie/emulation/hw_em/generic_pcie_hal2/halapi.cxx:155
#6  0x00007ffff6ff62b4 in xdp::hw_emu::trace::profiling_wrapper<xclFreeBO(xclDeviceHandle, unsigned int)::<lambda()> >(const char *, struct {...} &&) (function=0x7ffff711e9fd "xclFreeBO", f=...)
    at XRT/src/runtime_src/core/pcie/emulation/hw_em/generic_pcie_hal2/plugin/xdp/hal_trace.h:79
#7  0x00007ffff6ff6334 in xclFreeBO (handle=0x55555558e0b0, boHandle=2)
    at XRT/src/runtime_src/core/pcie/emulation/hw_em/generic_pcie_hal2/halapi.cxx:151
#8  0x00007ffff6ff2057 in xrt_core::shim<xrt_core::device_pcie>::free_bo (this=0x555555593730, bo=2)
    at XRT/src/runtime_src/core/common/ishim.h:282
#9  0x00007ffff7d80a4e in xrt::bo_impl::~bo_impl (this=0x5555555b9b60, __in_chrg=<optimized out>)
    at XRT/src/runtime_src/core/common/api/xrt_bo.cpp:227
#10 0x00007ffff7d986ec in xrt::buffer_hbuf::~buffer_hbuf (this=0x5555555b9b60, __in_chrg=<optimized out>)
    at XRT/src/runtime_src/core/common/api/xrt_bo.cpp:448
stsoe commented 2 years ago

Hi @akasat Please help assign this issue properly. Not sure why you removed the assignment without assigning someone else?

keryell commented 2 years ago

I created a work-around for this in https://github.com/Xilinx/XRT/pull/6269

keryell commented 2 years ago

This is tracked internally with https://jira.xilinx.com/browse/CR-1120194 and there is a non-SYCL pure XRT & HLS reproducer example in https://jira.xilinx.com/browse/XRT-937

venkatp-xilinx commented 2 years ago

@sgundime-xilinx Identified the issue. Fix is in progress. The order of messageThread and unix_socket creation is updated. With this fix, we are not seeing any crash or segfault. Will create the PR shortly.

keryell commented 11 months ago

What was the PR fixing this?

sgundime-xilinx commented 11 months ago

The issue was resolved with an introduction of a monitoring flag which runs periodically. The read/write calls are protected with flag before really making calls. If any client/server gets disconnected then the thread gets notified with the flag. The CR-1120194 addressed this issue and resolved too.
PR: https://github.com/Xilinx/XRT/pull/6623