Xilinx / xup_vitis_network_example

VNx: Vitis Network Examples
Other
137 stars 43 forks source link

vnx-benchmark-throughput-switch on U280 failed #2

Closed csunclechen closed 4 years ago

csunclechen commented 4 years ago

image

Exception in thread Thread-4: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/cjj/.local/lib/python3.6/site-packages/pynq/pl_server/server.py", line 542, in server_proc server = Listener(self.socket_name, family='AF_UNIX', authkey=self.key) File "/usr/lib/python3.6/multiprocessing/connection.py", line 438, in init self._listener = SocketListener(address, family, backlog) File "/usr/lib/python3.6/multiprocessing/connection.py", line 576, in init self._socket.bind(address) FileNotFoundError: [Errno 2] No such file or directory

Exception in thread Thread-5: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/cjj/.local/lib/python3.6/site-packages/pynq/pl_server/server.py", line 542, in server_proc server = Listener(self.socket_name, family='AF_UNIX', authkey=self.key) File "/usr/lib/python3.6/multiprocessing/connection.py", line 438, in init self._listener = SocketListener(address, family, backlog) File "/usr/lib/python3.6/multiprocessing/connection.py", line 576, in init self._socket.bind(address) FileNotFoundError: [Errno 2] No such file or directory


ConnectionError Traceback (most recent call last)

in 4 xclbin = '../benchmark.intf0.xilinx_u280_xdma_201920_3/vnx_benchmark_if0.xclbin' 5 #overlay_1 = pynq.Overlay(xclbin, device=pynq.Device.devices[0]) ----> 6 ol_w0 = pynq.Overlay(xclbin, device=daskdev_w0) 7 ol_w1 = pynq.Overlay(xclbin, device=daskdev_w1) ~/.local/lib/python3.6/site-packages/pynq/overlay.py in __init__(self, bitfile_name, dtbo, download, ignore_version, device) 342 343 if download: --> 344 self.download() 345 346 self.__doc__ = _build_docstring(self._ip_map._description, ~/.local/lib/python3.6/site-packages/pynq/overlay.py in download(self, dtbo) 401 Clocks.set_pl_clk(i) 402 --> 403 super().download(self.parser) 404 if dtbo: 405 super().insert_dtbo(dtbo) ~/.local/lib/python3.6/site-packages/pynq/bitstream.py in download(self, parser) 152 153 """ --> 154 self.device.download(self, parser) 155 156 def remove_dtbo(self): in download(self, bitstream, parser) 102 bitstream_data = f.read() 103 self._call_dask(_download, bitstream_data) --> 104 super().post_download(bitstream, parser) 105 106 def get_memory_by_idx(self, idx): ~/.local/lib/python3.6/site-packages/pynq/pl_server/device.py in post_download(self, bitstream, parser) 441 t.year, t.month, t.day, 442 t.hour, t.minute, t.second, t.microsecond) --> 443 self.reset(parser, bitstream.timestamp, bitstream.bitfile_name) 444 445 def has_capability(self, cap): ~/.local/lib/python3.6/site-packages/pynq/pl_server/device.py in reset(self, parser, timestamp, bitfile_name) 316 317 """ --> 318 self._client.reset(parser, timestamp, bitfile_name) 319 320 def clear_dict(self): ~/.local/lib/python3.6/site-packages/pynq/pl_server/server.py in reset(self, parser, timestamp, bitfile_name) 250 251 """ --> 252 self.client_request() 253 if parser is not None: 254 self._ip_dict = parser.ip_dict ~/.local/lib/python3.6/site-packages/pynq/pl_server/server.py in client_request(self) 489 except FileNotFoundError: 490 raise ConnectionError( --> 491 "Could not connect to PL server") from None 492 self._bitfile_name, self._timestamp, \ 493 self._ip_dict, self._gpio_dict, \ ConnectionError: Could not connect to PL server
csunclechen commented 4 years ago

I was trying to test the benchmark. I have two host servers and two alveo U280 cards. each host connect one card through pcie. And two cards connect the switch through qsfp28. Have you met this problem before?

mariodruiz commented 4 years ago

Hi @csunclechen

Can you run this piece of code in the jupyter notebook?

for i in range(len(pynq.Device.devices)):
    print("{}) {}".format(i, pynq.Device.devices[i].name))

If you get the a list of Alveo platforms everything is OK with pynq and XRT environment.

The next thing to consider is the the DaskDevice class, in particular the __init__ method. I am always giving name to the workers when creating the cluster. However, if no name is given to a worker, the default name is tcp://... This default name produces an error

Can you update the __init__ method of the DaskDevice class, import import re as well

def __init__(self, client, worker):
        """The worker ID should be unique

        """
        worker_id= re.sub(r'[^\w]', '_', worker)
        super().__init__("dask-" + worker_id)
        self._dask_client = client
        self._worker = worker
        self.capabilities = {
            'REGISTER_RW': True,
            'CALLABLE': True
        }
        self._streams = {}
mariodruiz commented 4 years ago

@csunclechen,

I was able to reproduce the issue and pushed a bug fix. Can you try the latest notebooks?