geerlingguy / raspberry-pi-pcie-devices

Raspberry Pi PCI Express device compatibility database
http://pipci.jeffgeerling.com
GNU General Public License v3.0
1.62k stars 145 forks source link

Test Google Coral TPU M.2 Accelerator A+E key #44

Open geerlingguy opened 3 years ago

geerlingguy commented 3 years ago

I just bought a Coral M.2 Accelerator A+E key after seeing a lot of buzz about this little 'IoT' TensorFlow-compatible AI accelerator.

coral-tpu

I also just received an M.2 A key to PCIe 1x slot adapter card (see https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/38), and I will pop the Coral board into there.

I haven't done much with AI/ML, but apparently one of the big holdups for using this board with the Pi may be overcome soon—see: https://github.com/google-coral/edgetpu/issues/280

(And I don't want to step on @timonsku's toes here either—he was the original inspiration for me getting this particular card, after seeing his Piunora; just figured now is as good a time as ever to take a quick stab at TensorFlow.)

See related, in the Pi Forums: https://www.raspberrypi.org/forums/viewtopic.php?p=1772610&sid=4833ac3f714618282207affca2bcd846#p1772610

And the patch advertising MSI-X support in the Pi Kernel (currently only on 5.10.y branch): https://github.com/raspberrypi/linux/commit/6bf63f7711b550de8c803a4c4ad792ecfbe721df

(Note that it may be incorporated into Ubuntu for Pi too... https://twitter.com/m_wimpress/status/1345077692568367105

r3po-1s-Tr3e commented 7 months ago

Blog post with full summary: A PCIe Coral TPU FINALLY works on Raspberry Pi 5.

@geerlingguy , amazing guide. But we have been facing a problem, we are not able to install pcie gasket drivers after changing the device tree settings. Rpi is not able to update its kernel headers after changing the drive tree settings. I have tried downgrading rpi os from 6.6 to 6.1, but still it dosent work. if i dont change the device tree settings, the gasket drivers are installing correctly, but then the image classification sample thorws an error that it cant access dev/apex_0

Harwdare: Raspberry pi 5 4gb, nvme base for rpi, coral m.2 tpu b+m key

any insight will be appreciated, thanks!

mikegapinski commented 7 months ago

@r3po-1s-Tr3e follow this thread, it has all the info for Coral. I have not tested the B+M recently but A+E works ok so this one should too

https://gist.github.com/dataslayermedia/714ec5a9601249d9ee754919dea49c7e?permalink_comment_id=4989560#gistcomment-4989560

r3po-1s-Tr3e commented 7 months ago

@mikegapinski It worked! I changed a few things, if anyone is intrested, here are the details:

Hardware: Raspberry pi 5 4gb Coral TPU M.2 Accelerator B+M key NVMe Base PCIe extension HAT for Raspberry pi 5 OS: 6.1.0-rpi4-rpi

Procedure which i followed (I ommited a few steps from Jeff Geerling's guide and changed sequence of action):

echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" | sudo tee /etc/apt/sources.list.d/coral-edgetpu.list curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add - sudo apt-get update

sudo apt-get install gasket-dkms libedgetpu1-std

sudo sh -c "echo 'SUBSYSTEM==\"apex\", MODE=\"0660\", GROUP=\"apex\"' >> /etc/udev/rules.d/65-apex.rules" sudo groupadd apex sudo adduser $USER apex

Now REBOOT, after reboot check if TPU is detected by: lspci -nn | grep 089a #Output: 03:00.0 System peripheral: Device 1ac1:089a Check if pcie driver is loaded: ls /dev/apex_0 #Output: /dev/apex_0

If i tried to follow these steps after changing the device tree, I would have not been able to install the gasket and pcie drivers because of not being able to install headers. Further steps:

echo "kernel=kernel8.img" | sudo tee -a /boot/config.txt Omitted other steps regarding config changes which were mentioned in the guide Then changed the Device tree settings: https://www.jeffgeerling.com/blog/2023/how-customize-dtb-device-tree-binary-on-raspberry-pi (NOTE: If you are on kernel version 6.6, which is currently the latest, change the msi-parent settings to 0x67 instead of 0x66. Source: https://gist.github.com/dataslayermedia/714ec5a9601249d9ee754919dea49c7e?permalink_comment_id=4989560#gistcomment-4989560)

Installed docker: curl -sSL https://get.docker.com | sh

Then followed the steps from : https://www.jeffgeerling.com/blog/2023/testing-coral-tpu-accelerator-m2-or-pcie-docker to create and run docker image

tested by running the sample image classifier script of tpu

zoldaten commented 3 months ago

in my case changed the msi-parent settings to 0x67 instead of 0x6e. Linux raspberrypi 6.6.31+rpt-rpi-v8

https://gist.github.com/dataslayermedia/714ec5a9601249d9ee754919dea49c7e?permalink_comment_id=4989560

JeffreyPeacock commented 2 months ago

Please forgive my terseness. Just the facts ma'am:

Raspberry Pi 5 8GB Coral Dual-Edge TPU 52PI EP-0223 HAT Linux pi5 6.6.47+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.47-1+rpt1 (2024-09-02) aarch64 GNU/Linux

root@pi5:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:    12
Codename:   bookworm

root@pi5:~# tail /boot/firmware/config.txt 
otg_mode=1

[cm5]
dtoverlay=dwc2,dr_mode=host

[all]
dtparam=pciex1
dtparam=pciex1_gen=2

# kernel=kernel8.img

I can see the device:

root@pi5:~# uname -a
Linux pi5 6.6.47+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.47-1+rpt1 (2024-09-02) aarch64 GNU/Linux

root@pi5:~# lsmod | egrep "dkms|apex"
apex                   49152  0
gasket                114688  1 apex

root@pi5:~# lspci -v

0000:01:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (prog-if ff)
    Subsystem: Global Unichip Corp. Coral Edge TPU
    Flags: fast devsel
    Memory at 1800100000 (64-bit, prefetchable) [disabled] [size=16K]
    Memory at 1800000000 (64-bit, prefetchable) [disabled] [size=1M]
    Capabilities: [80] Express Endpoint, MSI 00
    Capabilities: [d0] MSI-X: Enable- Count=128 Masked-
    Capabilities: [e0] MSI: Enable- Count=1/32 Maskable- 64bit+
    Capabilities: [f8] Power Management version 3
    Capabilities: [100] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
    Capabilities: [108] Latency Tolerance Reporting
    Capabilities: [110] L1 PM Substates
    Capabilities: [200] Advanced Error Reporting

root@pi5:~# ll /dev/apex_0 
crw-rw---- 1 root apex 120, 0 Sep  8 19:40 /dev/apex_0

root@pi5:~# ll /usr/lib/aarch64-linux-gnu/libedgetpu*
lrwxrwxrwx 1 root root      17 Jul  9  2021 /usr/lib/aarch64-linux-gnu/libedgetpu.so.1 -> libedgetpu.so.1.0
-rwxr-xr-x 1 root root 1135952 Jul  9  2021 /usr/lib/aarch64-linux-gnu/libedgetpu.so.1.0

However, I normally get this:

jeffp@pi5:~/Workspace/AI/Coral/pycoral $ python3 examples/classify_image.py --model test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite --labels test_data/inat_bird_labels.txt --input test_data/parrot.jpg
I tflite/edgetpu_manager_direct.cc:453] No matching device is already opened for shared ownership.
I driver/usb/local_usb_device.cc:944] EnumerateDevices: vendor:0x1a6e, product:0x89a
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[4] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[2]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[2] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[0]
I driver/usb/local_usb_device.cc:944] EnumerateDevices: vendor:0x18d1, product:0x9302
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[4] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[2]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[2] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[0]
I tflite/edgetpu_context_direct.cc:106] USB always DFU: False (default)
I tflite/edgetpu_context_direct.cc:128] USB bulk-in queue capacity: default
I tflite/edgetpu_context_direct.cc:67] Performance expectation: Max (default)
I ./driver/mmio/host_queue.h:266] Starting in normal mode
I driver/kernel/kernel_registers.cc:83] Opening /dev/apex_0. read_only=0
I tflite/edgetpu_context_direct.cc:401] Failed to open device [Apex (PCIe)] at [/dev/apex_0]: Failed precondition: Device open failed : -1 (Connection timed out)
Traceback (most recent call last):
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/tflite_runtime/interpreter.py", line 160, in load_delegate
    delegate = Delegate(library, options)
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/tflite_runtime/interpreter.py", line 119, in __init__
    raise ValueError(capture.message)
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jeffp/Workspace/AI/Coral/pycoral/examples/classify_image.py", line 124, in <module>
    main()
  File "/home/jeffp/Workspace/AI/Coral/pycoral/examples/classify_image.py", line 73, in main
    interpreter = make_interpreter(*args.model.split('@'))
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/pycoral/utils/edgetpu.py", line 87, in make_interpreter
    delegates = [load_edgetpu_delegate({'device': device} if device else {})]
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/pycoral/utils/edgetpu.py", line 52, in load_edgetpu_delegate
    return tflite.load_delegate(_EDGETPU_SHARED_LIB, options or {})
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/tflite_runtime/interpreter.py", line 162, in load_delegate
    raise ValueError('Failed to load delegate from {}\n{}'.format(
ValueError: Failed to load delegate from libedgetpu.so.1

And in syslog:

2024-09-08T19:32:42.970894-07:00 pi5 kernel: [   88.357226] apex 0000:01:00.0: RAM did not enable within timeout (12000 ms)
2024-09-08T19:32:42.970922-07:00 pi5 kernel: [   88.357233] apex 0000:01:00.0: Error in device open cb: -110

But, I tried @geerlingguy instructions for

  1. kernel=kernel8.img

  2. `rebooting # I can see the device

    # after kernel=kernel8.img
    Linux pi5 6.6.47+rpt-rpi-v8   #1 SMP PREEMPT Debian 1:6.6.47-1+rpt1 (2024-09-02) aarch64 GNU/Linux
  3. modifying bcm2712-rpi-5-b.dtb (as per the link and the updated comment for 0x67 not 0x66)

  4. dpkg-reconfigure dkms # does nothing

  5. dpkg-reconfigure gasket-dkms @geerlingguy (Note: This does not work on later versions. Nothing is in the /var/lib/initramfs-tools dir: _Then I rebuilt the DKMS using the oneliner:ls /var/lib/initramfs-tools | sudo xargs -n1 /usr/lib/dkms/dkms_autoinstaller start (without this, I don't have the kernel module for apex available).)_

  6. reboot

And the device is no longer seen.

Back that all out and do:

  1. reboot # no device
  2. dpkg-reconfigure gasket-dkms
  3. reboot

And now I get:

jeffp@pi5:~/Workspace/AI/Coral/pycoral $ python3 examples/classify_image.py --model test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite --labels test_data/inat_bird_labels.txt --input test_data/parrot.jpg
I tflite/edgetpu_manager_direct.cc:453] No matching device is already opened for shared ownership.
I driver/usb/local_usb_device.cc:944] EnumerateDevices: vendor:0x1a6e, product:0x89a
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[4] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[2]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[2] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[0]
I driver/usb/local_usb_device.cc:944] EnumerateDevices: vendor:0x18d1, product:0x9302
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[4] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[2]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[2] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[0]
I tflite/edgetpu_context_direct.cc:106] USB always DFU: False (default)
I tflite/edgetpu_context_direct.cc:128] USB bulk-in queue capacity: default
I tflite/edgetpu_context_direct.cc:67] Performance expectation: Max (default)
I ./driver/mmio/host_queue.h:266] Starting in normal mode
I driver/kernel/kernel_registers.cc:83] Opening /dev/apex_0. read_only=0
I driver/kernel/kernel_registers.cc:97] mmap_offset=0x0000000000040000, mmap_size=4096
I tflite/edgetpu_context_direct.cc:401] Failed to open device [Apex (PCIe)] at [/dev/apex_0]: Internal: Could not mmap: Operation not permitted
Traceback (most recent call last):
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/tflite_runtime/interpreter.py", line 160, in load_delegate
    delegate = Delegate(library, options)
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/tflite_runtime/interpreter.py", line 119, in __init__
    raise ValueError(capture.message)
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jeffp/Workspace/AI/Coral/pycoral/examples/classify_image.py", line 124, in <module>
    main()
  File "/home/jeffp/Workspace/AI/Coral/pycoral/examples/classify_image.py", line 73, in main
    interpreter = make_interpreter(*args.model.split('@'))
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/pycoral/utils/edgetpu.py", line 87, in make_interpreter
    delegates = [load_edgetpu_delegate({'device': device} if device else {})]
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/pycoral/utils/edgetpu.py", line 52, in load_edgetpu_delegate
    return tflite.load_delegate(_EDGETPU_SHARED_LIB, options or {})
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/tflite_runtime/interpreter.py", line 162, in load_delegate
    raise ValueError('Failed to load delegate from {}\n{}'.format(
ValueError: Failed to load delegate from libedgetpu.so.1

An in syslog:

2024-09-08T19:42:45.548846-07:00 pi5 kernel: [  112.693475] apex 0000:01:00.0: Couldn't reinit interrupts: -28
2024-09-08T19:42:45.548869-07:00 pi5 kernel: [  112.693497] apex 0000:01:00.0: Permission checking failed.

But all the permissions are the same and the user id is in the apex group.

So, did I go forward or backwards?

Any help is appreciated. Thanks in advance.

JeffreyPeacock commented 2 months ago

No, I fixed the kernel8.img rebuild errors your going to find above -- used dpkg-reconfigure linux-image-6.6.47+rpt-rpi-v8. Also updated the tree as per posts. Then I could see the device but I am still stuck with/at:

jeffp@pi5:~/Workspace/AI/Coral/pycoral $ python3 examples/classify_image.py --model test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite --labels test_data/inat_bird_labels.txt --input test_data/parrot.jpg
I tflite/edgetpu_manager_direct.cc:453] No matching device is already opened for shared ownership.
I driver/usb/local_usb_device.cc:944] EnumerateDevices: vendor:0x1a6e, product:0x89a
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[4] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[2]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[2] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[0]
I driver/usb/local_usb_device.cc:944] EnumerateDevices: vendor:0x18d1, product:0x9302
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[4] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[2]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[3] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[2] port[0]
I driver/usb/local_usb_device.cc:979] EnumerateDevices: checking bus[1] port[0]
I tflite/edgetpu_context_direct.cc:106] USB always DFU: False (default)
I tflite/edgetpu_context_direct.cc:128] USB bulk-in queue capacity: default
I tflite/edgetpu_context_direct.cc:67] Performance expectation: Max (default)
I ./driver/mmio/host_queue.h:266] Starting in normal mode
I driver/kernel/kernel_registers.cc:83] Opening /dev/apex_0. read_only=0
I tflite/edgetpu_context_direct.cc:401] Failed to open device [Apex (PCIe)] at [/dev/apex_0]: Failed precondition: Device open failed : -1 (Connection timed out)
Traceback (most recent call last):
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/tflite_runtime/interpreter.py", line 160, in load_delegate
    delegate = Delegate(library, options)
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/tflite_runtime/interpreter.py", line 119, in __init__
    raise ValueError(capture.message)
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jeffp/Workspace/AI/Coral/pycoral/examples/classify_image.py", line 124, in <module>
    main()
  File "/home/jeffp/Workspace/AI/Coral/pycoral/examples/classify_image.py", line 73, in main
    interpreter = make_interpreter(*args.model.split('@'))
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/pycoral/utils/edgetpu.py", line 87, in make_interpreter
    delegates = [load_edgetpu_delegate({'device': device} if device else {})]
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/pycoral/utils/edgetpu.py", line 52, in load_edgetpu_delegate
    return tflite.load_delegate(_EDGETPU_SHARED_LIB, options or {})
  File "/home/jeffp/.pyenv/versions/3.9.18/lib/python3.9/site-packages/tflite_runtime/interpreter.py", line 162, in load_delegate
    raise ValueError('Failed to load delegate from {}\n{}'.format(
ValueError: Failed to load delegate from libedgetpu.so.1

And syslog says:


2024-09-08T20:38:43.965306-07:00 pi5 kernel: [   81.724011] apex 0000:01:00.0: RAM did not enable within timeout (12000 ms)
2024-09-08T20:38:43.965331-07:00 pi5 kernel: [   81.724020] apex 0000:01:00.0: Error in device open cb: -110