Running vnx-basic notebook with an FPGA and a 100Gb NIC in different hosts

pouya-haghi commented 2 years ago

Hi,

First, I would like to thank you for such a great repository. My question is that is it possible to run the basic benchmark (vnx-basic.ipynb) in a way that the FPGA and a 100Gb NIC would not be in the same host? If not, what extensions/changes needed to be done to make this happen? In my scenario, there is an FPGA attached to a node and there is a 100Gb NIC in another node; they are connected together through a switch. I made sure to assign the ip addresses in the same subnet (NIC was '198.22.255.174' and alveo U280 was '198.22.255.12'). I tested it but didn't work. I would appreciate it if you could suggest me some hints to modify the basic benchmark in order to run a test (sending/receiving data) with this scenario.

Thank you very much!

mariodruiz commented 2 years ago

Hi @pouya-haghi,

I assume that when you use basic benchmark you mean the benchmark design. If so, you won't be able to use this design with a NIC with the current examples unless you change them significantly and you may even need to implement some UDP functionality in the host to replicate what the design does. This design was created to talk to another FPGA.

If you just want to verify the image transfer notebook with the basic design, you may need to use DASK to control both systems from the same machine. You can check the dask notebooks for inspiration.

Mario

pouya-haghi commented 2 years ago

Hi Mario,

Sorry for my bad illustration. I meant the basic design. So, should I use Dask with vnx-basic.ipynb in my scenario where a 100G NIC (in one host) communicates with an FPGA (in another host)?

Thank you!

mariodruiz commented 2 years ago

Yes, I suggest that you setup the system with the FPGA as the remote worker.

pouya-haghi commented 2 years ago

Hi Mario,

I followed the instructions from this link to set up DASK. There are two physical nodes (node 0: 100 Gb NIC, node 1: FPGA) connected through a switch and I'm running basic design on FPGA. I opened up four terminals: running dask-scheduler on node 0, running dask-worker on node 0, running dask-worker on node 1, and a Python session on node 0 to run notebook code. I installed the latest version of conda (2022.10) on both nodes which already has a DASK in the base environment and made sure that I’m using the same version on both nodes. I verified that verify_workers function worked properly (for a simple addition function, not the one used in image transfer notebook). I had the same issue as #35 so I sourced XRT on four terminals and it resolved the error. Since %run dask_pynq.py gave me an error (not sure what does % do), I instead ran from dask_pynq import *. However, after running pynq.Overlay I got the following error. I would like to mention that programming FPGA with the xclbin file works without DASK. I couldn't find a good solution online. I would appreciate it if you could help me with that. Thank you very much!

>>> ol_w1 = pynq.Overlay(xclbin, device=daskdev_w1)
/users/haghi/anaconda3/lib/python3.9/site-packages/distributed/worker.py:2845: UserWarning: Large object of size 49.10 MiB detected in task graph: 
  (b'xclbin2\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff ... ROR_DATA_END',)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers
    future = client.submit(func, big_data)    # bad
    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/overlay.py", line 354, in __init__
    self.download()
  File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/overlay.py", line 420, in download
    super().download(self.parser)
  File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/bitstream.py", line 187, in download
    self.device.download(self, parser)
  File "/users/haghi/xup_vitis_network_example/Notebooks/dask_pynq.py", line 159, in download
    self._call_dask(_download, bitstream_data)
  File "/users/haghi/xup_vitis_network_example/Notebooks/dask_pynq.py", line 123, in _call_dask
    return future.result()
  File "/users/haghi/anaconda3/lib/python3.9/site-packages/distributed/client.py", line 280, in result
    raise exc.with_traceback(tb)
  File "/users/haghi/xup_vitis_network_example/Notebooks/dask_pynq.py", line 73, in _download
    ol = pynq.Overlay(f.name)
  File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/overlay.py", line 336, in __init__
    super().__init__(bitfile_name, dtbo, partial=False, device=device)
  File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/bitstream.py", line 111, in __init__
    device = Device.active_device
  File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/pl_server/device.py", line 93, in active_device
    if len(cls.devices) == 0:
  File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/pl_server/device.py", line 77, in devices
    cls._devices.extend(DeviceMeta._subclasses[key]._probe_())
  File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/pl_server/xrt_device.py", line 329, in _probe_
    devices = [XrtDevice(i) for i in range(num)]
  File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/pl_server/xrt_device.py", line 329, in <listcomp>
    devices = [XrtDevice(i) for i in range(num)]
  File "/users/haghi/anaconda3/lib/python3.9/site-packages/pynq/pl_server/xrt_device.py", line 349, in __init__
    self._loop = asyncio.get_event_loop()
  File "/users/haghi/anaconda3/lib/python3.9/asyncio/events.py", line 642, in get_event_loop
    raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'Dask-Default-Threads-25449-0'.

====================================================== OS: Ubuntu 18.04.1 LTS, bionic XRT version: 2.11.634 (Build Version Branch: 2021.1) Vitis version: 2021.1 PYNQ version: I typed “pip3 install pynq==2.8.0.dev0” on both machines. DASK version: dask 2022.7.0 py39h06a4308_0
dask-core 2022.7.0 py39h06a4308_0
distributed 2022.7.0 py39h06a4308_0

mariodruiz commented 2 years ago

Hi @pouya-haghi,

I haven't tested with this particular environment. So, I am not 100% sure what could be the problem. Unfortunately, I do not have the time to try to reproduce this problem.

My suggestions would be:

run the notebook on the machine with the Alveo card and use dask to work with the machine with the NIC.
Try this workaround with the asyncio event https://stackoverflow.com/a/46750562
Use an older version of anaconda and pynq 2.7.0. https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh

%run will import the code of the file into the notebook, so you can execute it distributively. https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-run

pouya-haghi commented 2 years ago

Hi Mario,

Thank you so much for your response! I found the first suggestion a good solution for my case and I followed that. However, I found that the program is waiting at s2mm_wh.wait(). I thought maybe the issue is not because of DASK; instead I tried to test a simpler experiment (a basic design with only one node, both FPGA and 100G NIC are in the same host) without DASK. I exactly followed the vnx-basic.ipynb. I again encountered the same problem: !ping -c 5 $alveo_ipaddr didn't work and then the program is waiting at s2mm_wh.wait() indefinitely and I had to press ctrl+C which means again it doesn't work. I'm not sure if this problem should be belong to a separate issue but anyway I just posted it here. Since, in my setup, each node has three interfaces connected to a switch (one 100G NIC, one 40G NIC, and two 100Gb ports for FPGA) I even tried !ping -I <name_of_100gNIC_interface> -c 5 $alveo_ipaddr (to avoid using 40G NIC) but again no success. In my experiments, I set the ip address of 100G NIC through ifconfig <name_of_100gNIC_interface> <ip addresss> netmask 255.255.255.0 up and made sure that the FPGA and 100G NIC are in the same subnet. I would appreciate it if you give me some comments. I fully understand your limited time and it's fine if you don't have time to me to give me some hints.

Many thanks! PS. While it might not be very relevant I have tested the benchmark design with my setup and it works.

mariodruiz commented 2 years ago

@pouya-haghi,

Please open a separate issue and answer these questions:

Is link detected in the FPGA?
Did you try running the arp discovery in the FPGA and checking the ARP table?
Did you try using arpping?

Mario

pouya-haghi commented 2 years ago

Thank you Mario, sure, let me create a new issue.

pouya-haghi commented 1 year ago

Hi @mariodruiz

After resolving the issue #94 (installing 100G NIC driver) I could make a two node system (FPGA and 100G NIC are in separate servers) work with the basic design (without having to use DASK). I'm going to close this issue. Thank you for your help!

Xilinx / xup_vitis_network_example

Running vnx-basic notebook with an FPGA and a 100Gb NIC in different hosts #93