Closed pouya-haghi closed 1 year ago
Hi @pouya-haghi,
I assume that when you use basic benchmark
you mean the benchmark design. If so, you won't be able to use this design with a NIC with the current examples unless you change them significantly and you may even need to implement some UDP functionality in the host to replicate what the design does. This design was created to talk to another FPGA.
If you just want to verify the image transfer notebook with the basic
design, you may need to use DASK to control both systems from the same machine. You can check the dask notebooks for inspiration.
Mario
Hi Mario,
Sorry for my bad illustration. I meant the basic
design. So, should I use Dask with vnx-basic.ipynb in my scenario where a 100G NIC (in one host) communicates with an FPGA (in another host)?
Thank you!
Yes, I suggest that you setup the system with the FPGA as the remote worker.
Hi Mario,
I followed the instructions from this link to set up DASK. There are two physical nodes (node 0: 100 Gb NIC, node 1: FPGA) connected through a switch and I'm running basic
design on FPGA. I opened up four terminals: running dask-scheduler on node 0, running dask-worker on node 0, running dask-worker on node 1, and a Python session on node 0 to run notebook code. I installed the latest version of conda (2022.10) on both nodes which already has a DASK in the base environment and made sure that I’m using the same version on both nodes. I verified that verify_workers
function worked properly (for a simple addition function, not the one used in image transfer notebook). I had the same issue as #35 so I sourced XRT on four terminals and it resolved the error. Since %run dask_pynq.py
gave me an error (not sure what does % do), I instead ran from dask_pynq import *
. However, after running pynq.Overlay
I got the following error. I would like to mention that programming FPGA with the xclbin file works without DASK. I couldn't find a good solution online. I would appreciate it if you could help me with that. Thank you very much!
>>> ol_w1 = pynq.Overlay(xclbin, device=daskdev_w1)
/users/haghi/anaconda3/lib/python3.9/site-packages/distributed/worker.py:2845: UserWarning: Large object of size 49.10 MiB detected in task graph:
(b'xclbin2\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff ... ROR_DATA_END',)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
warnings.warn(
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/overlay.py", line 354, in __init__
self.download()
File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/overlay.py", line 420, in download
super().download(self.parser)
File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/bitstream.py", line 187, in download
self.device.download(self, parser)
File "/users/haghi/xup_vitis_network_example/Notebooks/dask_pynq.py", line 159, in download
self._call_dask(_download, bitstream_data)
File "/users/haghi/xup_vitis_network_example/Notebooks/dask_pynq.py", line 123, in _call_dask
return future.result()
File "/users/haghi/anaconda3/lib/python3.9/site-packages/distributed/client.py", line 280, in result
raise exc.with_traceback(tb)
File "/users/haghi/xup_vitis_network_example/Notebooks/dask_pynq.py", line 73, in _download
ol = pynq.Overlay(f.name)
File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/overlay.py", line 336, in __init__
super().__init__(bitfile_name, dtbo, partial=False, device=device)
File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/bitstream.py", line 111, in __init__
device = Device.active_device
File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/pl_server/device.py", line 93, in active_device
if len(cls.devices) == 0:
File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/pl_server/device.py", line 77, in devices
cls._devices.extend(DeviceMeta._subclasses[key]._probe_())
File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/pl_server/xrt_device.py", line 329, in _probe_
devices = [XrtDevice(i) for i in range(num)]
File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/pl_server/xrt_device.py", line 329, in <listcomp>
devices = [XrtDevice(i) for i in range(num)]
File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/pl_server/xrt_device.py", line 349, in __init__
self._loop = asyncio.get_event_loop()
File "/users/haghi/anaconda3/lib/python3.9/asyncio/events.py", line 642, in get_event_loop
raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'Dask-Default-Threads-25449-0'.
======================================================
OS: Ubuntu 18.04.1 LTS, bionic
XRT version: 2.11.634 (Build Version Branch: 2021.1)
Vitis version: 2021.1
PYNQ version: I typed “pip3 install pynq==2.8.0.dev0” on both machines.
DASK version:
dask 2022.7.0 py39h06a4308_0
dask-core 2022.7.0 py39h06a4308_0
distributed 2022.7.0 py39h06a4308_0
Hi @pouya-haghi,
I haven't tested with this particular environment. So, I am not 100% sure what could be the problem. Unfortunately, I do not have the time to try to reproduce this problem.
My suggestions would be:
%run
will import the code of the file into the notebook, so you can execute it distributively. https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-run
Hi Mario,
Thank you so much for your response! I found the first suggestion a good solution for my case and I followed that. However, I found that the program is waiting at s2mm_wh.wait()
. I thought maybe the issue is not because of DASK; instead I tried to test a simpler experiment (a basic
design with only one node, both FPGA and 100G NIC are in the same host) without DASK. I exactly followed the vnx-basic.ipynb
. I again encountered the same problem: !ping -c 5 $alveo_ipaddr
didn't work and then the program is waiting at s2mm_wh.wait()
indefinitely and I had to press ctrl+C which means again it doesn't work. I'm not sure if this problem should be belong to a separate issue but anyway I just posted it here. Since, in my setup, each node has three interfaces connected to a switch (one 100G NIC, one 40G NIC, and two 100Gb ports for FPGA) I even tried !ping -I <name_of_100gNIC_interface> -c 5 $alveo_ipaddr
(to avoid using 40G NIC) but again no success. In my experiments, I set the ip address of 100G NIC through ifconfig <name_of_100gNIC_interface> <ip addresss> netmask 255.255.255.0 up
and made sure that the FPGA and 100G NIC are in the same subnet. I would appreciate it if you give me some comments. I fully understand your limited time and it's fine if you don't have time to me to give me some hints.
Many thanks!
PS. While it might not be very relevant I have tested the benchmark
design with my setup and it works.
@pouya-haghi,
Please open a separate issue and answer these questions:
arpping
?Mario
Thank you Mario, sure, let me create a new issue.
Hi @mariodruiz
After resolving the issue #94 (installing 100G NIC driver) I could make a two node system (FPGA and 100G NIC are in separate servers) work with the basic design (without having to use DASK). I'm going to close this issue. Thank you for your help!
Hi,
First, I would like to thank you for such a great repository. My question is that is it possible to run the basic benchmark (vnx-basic.ipynb) in a way that the FPGA and a 100Gb NIC would not be in the same host? If not, what extensions/changes needed to be done to make this happen? In my scenario, there is an FPGA attached to a node and there is a 100Gb NIC in another node; they are connected together through a switch. I made sure to assign the ip addresses in the same subnet (NIC was '198.22.255.174' and alveo U280 was '198.22.255.12'). I tested it but didn't work. I would appreciate it if you could suggest me some hints to modify the basic benchmark in order to run a test (sending/receiving data) with this scenario.
Thank you very much!