aws / aws-fpga

Official repository of the AWS EC2 FPGA Hardware and Software Development Kit
Other
1.51k stars 514 forks source link

udev rule not running after loading fpga image #560

Open cmoore1776 opened 2 years ago

cmoore1776 commented 2 years ago

Summary

The udev rule created by add_udev_rules.sh does not match the device ID used after loading an fpga image.

The rule, which is deployed to /etc/udev/rules.d/9999-presistent-fpga.rules, only matches on:

ATTR{device}=="0x1041"
ATTR{device}=="0x1042"

but it needs to also match on:

ATTR{device}=="0xf001"

Reproduction steps

  1. Launch an F1 instance on the latest AL2 FPGA Developer AMI
  2. Deploy the aws-fpga SDK
  3. Load an fpga image, e.g.
    fpga-load-local-image -S 0 -I agfi-xxxxxSOMExIDxxxxx
  4. Note the permissions at /sys/devices/pci0000:00/0000:00:1d.0/resource* are 444
$ ls -lah /sys/devices/pci0000\:00/0000\:00\:1d.0/resource*
-r--r--r-- 1 root root 4.0K Apr 27 16:14 /sys/devices/pci0000:00/0000:00:1d.0/resource
-r--r--r-- 1 root root  32M Apr 27 16:14 /sys/devices/pci0000:00/0000:00:1d.0/resource0
-r--r--r-- 1 root root 2.0M Apr 27 16:14 /sys/devices/pci0000:00/0000:00:1d.0/resource1
-r--r--r-- 1 root root  64K Apr 27 16:14 /sys/devices/pci0000:00/0000:00:1d.0/resource2
-r--r--r-- 1 root root  64K Apr 27 16:14 /sys/devices/pci0000:00/0000:00:1d.0/resource2_wc
-r--r--r-- 1 root root 128G Apr 27 16:14 /sys/devices/pci0000:00/0000:00:1d.0/resource4
-r--r--r-- 1 root root 128G Apr 27 16:14 /sys/devices/pci0000:00/0000:00:1d.0/resource4_wc

Also note the device ID after loading the image:

$ sudo udevadm info -a -p /devices/pci0000:00/0000:00:1d.0 | grep "ATTR{device}"
ATTR{device}=="0xf001"

Fix

Add the following two lines to /etc/udev/rules.d/9999-presistent-fpga.rules:

ATTR{vendor}=="0x1d0f", ATTR{device}=="0xf001", RUN+="/opt/aws/bin/change-fpga-perm.sh %k"
ATTR{vendor}=="0x1d0f", ATTR{device}=="0xf001", ACTION=="add", RUN+="/opt/aws/bin/change-fpga-perm.sh %k"

After loading an image, permissions are 666:

$ ls -lah /sys/devices/pci0000\:00/0000\:00\:1d.0/resourc*
-r--r--r-- 1 root root 4.0K May 18 14:34 /sys/devices/pci0000:00/0000:00:1d.0/resource
-rw-rw-rw- 1 root root  32M May 18 14:34 /sys/devices/pci0000:00/0000:00:1d.0/resource0
-rw-rw-rw- 1 root root 2.0M May 18 14:34 /sys/devices/pci0000:00/0000:00:1d.0/resource1
-rw-rw-rw- 1 root root  64K May 18 14:34 /sys/devices/pci0000:00/0000:00:1d.0/resource2
-rw-rw-rw- 1 root root  64K May 18 14:34 /sys/devices/pci0000:00/0000:00:1d.0/resource2_wc
-rw-rw-rw- 1 root root 128G May 18 14:34 /sys/devices/pci0000:00/0000:00:1d.0/resource4
-rw-rw-rw- 1 root root 128G May 18 14:34 /sys/devices/pci0000:00/0000:00:1d.0/resource4_wc
jacobmgn commented 2 years ago

Thanks for reporting this. For reproduction step 3

fpga-load-local-image -S 0 -I agfi-xxxxxSOMExIDxxxxx

Does the image loaded specify a device ID as per https://github.com/aws/aws-fpga/blob/4750aacb4dac9d464b099b27e4337220cf0b0713/hdk/cl/examples/cl_dram_dma_hlx/README.md#create-example-design-gui ?

set ::env(device_id) "0xF001"
set ::env(vendor_id) "0x1D0F"
set ::env(subsystem_id) "0x1D51"
set ::env(subsystem_vendor_id) "0xFEDC"

For example, the cl_dram_dma example is configured to use 0xf001

If so, what device_id is specified.

cmoore1776 commented 2 years ago

Does the image loaded specify a device ID as per https://github.com/aws/aws-fpga/blob/4750aacb4dac9d464b099b27e4337220cf0b0713/hdk/cl/examples/cl_dram_dma_hlx/README.md#create-example-design-gui ?

Yes, 0xf001 is based on using the device_id provided in the example.

jacobmgn commented 2 years ago

I think I understand the issue, so let me rephrase.

When following the steps in the HOW TO, setting a device ID of "0xF001" and then running the udev permission script, the PCIe device does not have the permissions properly applied.

Therefore

jacobmgn commented 2 years ago

Notes:

jacobmgn commented 2 years ago

Hello @shamelesscookie ,

I have been trying to reproduce the issue you described, along with the fix in PR #561 . I haven't been able to reproduce the device permissions you list under step 4.

[centos@ip-172-31-83-184 ~]$ ls -lah /sys/devices/pci0000\:00/0000\:00\:1d.0/resource*
-r--r--r-- 1 root root 4.0K Jun 15 00:47 /sys/devices/pci0000:00/0000:00:1d.0/resource
-rw------- 1 root root  32M Jun 15 00:47 /sys/devices/pci0000:00/0000:00:1d.0/resource0
-rw------- 1 root root 2.0M Jun 15 00:47 /sys/devices/pci0000:00/0000:00:1d.0/resource1
-rw------- 1 root root  64K Jun 15 00:47 /sys/devices/pci0000:00/0000:00:1d.0/resource2
-rw------- 1 root root  64K Jun 15 00:47 /sys/devices/pci0000:00/0000:00:1d.0/resource2_wc
-rw------- 1 root root 128G Jun 15 00:47 /sys/devices/pci0000:00/0000:00:1d.0/resource4
-rw------- 1 root root 128G Jun 15 00:47 /sys/devices/pci0000:00/0000:00:1d.0/resource4_wc
[centos@ip-172-31-83-184 ~]$ sudo udevadm info -a -p /devices/pci0000:00/0000:00:1d.0 | grep "ATTR{device}"
    ATTR{device}=="0xf001"

Are you using any environment variables that are not listed in your reproduction steps?

As a note, I have been using the public cl_dram_dma AGFI ( agfi-0b5c35827af676702) with a PCI Device ID of 0xF001.

https://github.com/aws/aws-fpga/blob/4750aacb4dac9d464b099b27e4337220cf0b0713/hdk/cl/examples/cl_examples_list.md

AWSjoeluc commented 6 months ago

Hello!

Is there anything that AWS can help to resolve this issue? If the issue is resolved, we're curious to know the resolution.

Thank you!