Xilinx / ROCm-air-platforms

A POC platform and example for an experimental ROCm runtime release for the AMD AI Engine
MIT License
10 stars 0 forks source link

Possible obsolete driver version when installing VCK5000 platform with recent linux kernel #12

Open jbelot opened 3 months ago

jbelot commented 3 months ago

Hi,

As mentioned in this issue, I have troubles trying to build drivers for the VCK5000 platform. I encountered several issues, so in order to be as exhaustive as possible, I will detail all the error messages that I get and the workarounds I used to "fix" them.

My guess is that the driver is no longer compatible with the more recent linux kernel (mine is 6.5.0-41-generic).

First issue

Indeed, when running a make in the driver directory, I have the following error:

make

make -C /usr/src/linux-headers-`uname -r` M=$PWD
make[1]: Entering directory '/usr/src/linux-headers-6.5.0-41-generic'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
  You are using:           gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
  CC [M]  ~/ROCm-air-platforms/driver/amdair_chardev.o
In file included from ./include/linux/linkage.h:7,
                 from ./arch/x86/include/asm/cache.h:5,
                 from ./include/linux/cache.h:6,
                 from ./include/linux/time.h:5,
                 from ./include/linux/compat.h:10,
                 from ~/ROCm-air-platforms/driver/amdair_chardev.c:4:
~/ROCm-air-platforms/driver/amdair_chardev.c: In function ‘amdair_chardev_init’:
./include/linux/export.h:29:22: error: passing argument 1 of ‘class_create’ from incompatible pointer type [-Werror=incompatible-pointer-types]
   29 | #define THIS_MODULE (&__this_module)
      |                     ~^~~~~~~~~~~~~~~
      |                      |
      |                      struct module *
~/ROCm-air-platforms/driver/amdair_chardev.c:208:37: note: in expansion of macro ‘THIS_MODULE’
  208 |         amdair_class = class_create(THIS_MODULE, amdair_dev_name());
      |                                     ^~~~~~~~~~~
In file included from ./include/linux/device.h:31,
                 from ~/ROCm-air-platforms/driver/amdair_chardev.c:5:
./include/linux/device/class.h:230:54: note: expected ‘const char *’ but argument is of type ‘struct module *’
  230 | struct class * __must_check class_create(const char *name);
      |                                          ~~~~~~~~~~~~^~~~
~/ROCm-air-platforms/driver/amdair_chardev.c:208:24: error: too many arguments to function ‘class_create’
  208 |         amdair_class = class_create(THIS_MODULE, amdair_dev_name());
      |                        ^~~~~~~~~~~~
./include/linux/device/class.h:230:29: note: declared here
  230 | struct class * __must_check class_create(const char *name);
      |                             ^~~~~~~~~~~~
....
cc1: some warnings being treated as errors
make[3]: *** [scripts/Makefile.build:251: ~/ROCm-air-platforms/driver/amdair_chardev.o] Error 1
make[2]: *** [/usr/src/linux-headers-6.5.0-41-generic/Makefile:2039: ~/ROCm-air-platforms/driver] Error 2
make[1]: *** [Makefile:234: __sub-make] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-6.5.0-41-generic'
make: *** [Makefile:14: default] Error 2

First workaround

It seems that the error comes from the mention of THIS_MODULE in the line 208 of the file amdair_chardev.c, so I removed it, from:

    amdair_class = class_create(THIS_MODULE, amdair_dev_name());

into:

    amdair_class = class_create(amdair_dev_name());

Now the compilation seems to succeed, with these warnings though (which were also present before, but hidden in the ....):

make -C /usr/src/linux-headers-`uname -r` M=$PWD
make[1]: Entering directory '/usr/src/linux-headers-6.5.0-41-generic'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
  You are using:           gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
  CC [M]  ~/ROCm-air-platforms/driver/amdair_chardev.o
~/ROCm-air-platforms/driver/amdair_chardev.c: In function ‘address_store’:
~/ROCm-air-platforms/driver/amdair_chardev.c:651:9: warning: ignoring return value of ‘kstrtoul’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
  651 |         kstrtoul(buf, 0, &address);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~
~/ROCm-air-platforms/driver/amdair_chardev.c: In function ‘value_store’:
~/ROCm-air-platforms/driver/amdair_chardev.c:686:17: warning: ignoring return value of ‘kstrtouint’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
  686 |                 kstrtouint(buf, 0, (uint32_t*)(&arg[1]));
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~/ROCm-air-platforms/driver/amdair_chardev.c: In function ‘create_aie_mem_sysfs’:
~/ROCm-air-platforms/driver/amdair_chardev.c:734:9: warning: ignoring return value of ‘sysfs_create_groups’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
  734 |         sysfs_create_groups(&priv->kobj_aie, aie_sysfs_groups);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  LD [M]  ~/ROCm-air-platforms/driver/amdair.o
  MODPOST ~/ROCm-air-platforms/driver/Module.symvers
  LD [M]  ~/ROCm-air-platforms/driver/amdair.ko
  BTF [M] ~/ROCm-air-platforms/driver/amdair.ko
Skipping BTF generation for ~/ROCm-air-platforms/driver/amdair.ko due to unavailability of vmlinux
make[1]: Leaving directory '/usr/src/linux-headers-6.5.0-41-generic'

Second issue

But now, the load of the driver with sudo insmod amdair.ko does not seem to produce anything, as the /dev/amdair is not created. Note that the command sudo dmesg | grep amdair does not provide anything either.

Second workaround

I remarked the message: Skipping BTF generation for ~/ROCm-air-platforms/driver/amdair.ko due to unavailability of vmlinux when building the driver, so I tried to fix it following this Ubuntu forum.

Now the build succeed without this warning, but I still have the same problems as in second issue.

Are my workarounds relevant? Do you have any idea on how to fix my problem?

Thank you :)

muwyse-amd commented 3 months ago

Hi, thanks for the questions and for giving the platform a try!

My guess is that the driver is no longer compatible with the more recent linux kernel (mine is 6.5.0-41-generic).

Correct. We developed the driver and platform using kernel 5.4; at least that is the version on my machine. The kernel headers APIs changed between 5.4 and 6.5, which is the root cause of the first issue. If I recall correctly, your proposed "First Workaround" was sufficient in my testing to compile, load, and run the weather stencil test on my machine with kernel 6.4 (with the same warn_unused_result warnings present).

I have not encountered your second issue and kernel loading worked after the first fix for me on kernel 6.4. To confirm, did you program the VCK5000 (and warm reboot) and then confirm the card was visible (lspci -vd 10ee:) prior to doing the driver load?

eddierichter-amd commented 3 months ago

Hi @jbelot, were you able to resolve this issue?

Also, because you mentioned it in your other issue on mlir-aie, we just added the ability to connect PL components directly to the AIEs via the PLIO in this commit. We provided some documentation but this feature is very fresh, would love to get some initial feedback on the documentation and the functionality if that is something you are intersted in.

jbelot commented 3 months ago

Hi @muwyse-amd, @eddierichter-amd, and thank you for your answers.

I managed to get the weather stencil test to work (but only once!). The problem came from the fact that the card was not correctly programmed before loading the drivers (I hadn't run a warm reboot). I then tried the vector vector add example, but without success, and since then I've been unable to get it working properly again.

I have to admit that I quickly abandoned this flow since my team works mainly with XRT drivers and it's not viable to have to reboot the server every time you switch from a “standard” project to MLIR AIE.

So I was more interested in the flow proposed by https://github.com/nqdtan/vck5000_vivado_ulp and modified it a little to insert .elf files generated by MLIR AIE. I have something that works on a very simple example, but I was wondering how to use PLIOs in this case.

Your recent changes seem particularly interesting, but I guess I'll have to reprogram the board to allow the use of PLIOs.