fengggli / comanche

comanche
Apache License 2.0
0 stars 1 forks source link

vfio related #8

Open fengggli opened 5 years ago

fengggli commented 5 years ago
fengggli commented 5 years ago

Call stacks (why ioctol returns -1?)

  1. static long vfio_iommu_type1_ioctl(void *iommu_data, unsigned int cmd, unsigned long arg) (https://elixir.bootlin.com/linux/v4.15.18/source/drivers/vfio/vfio_iommu_type1.c#L1552)
  2. vfio_dma_do_map (https://elixir.bootlin.com/linux/v4.15.18/source/drivers/vfio/vfio_iommu_type1.c#L978)
  3. vfio_pin_map_dma(https://elixir.bootlin.com/linux/v4.15.18/source/drivers/vfio/vfio_iommu_type1.c#L935)
fengggli commented 5 years ago

Debugging/tracing vfio

apt-get source linux-source-4.15.0
cd linux-hwe-4.15.0/

Change kernel verionmagic

VERSION = 4
PATCHLEVEL = 15
#SUBLEVEL = 18
#EXTRAVERSION =
SUBLEVEL = 0
EXTRAVERSION =-54-generic
NAME = Fearless Coyote
make oldconfig
make prepare
make modules_prepare
make modules SUBDIRS=drivers/vfio/

Install them don't do modprobe, I don't bother to copy those to kernel/modules... (i wrapped those in the linux-hwe-4.15.0/insmod.sh and rmmod.sh)

  sudo insmod drivers/vfio/vfio.ko
  sudo insmod drivers/vfio/vfio_virqfd.ko 
  sudo insmod drivers/vfio/vfio_iommu_type1.ko
  sudo insmod drivers/vfio/pci/vfio-pci.ko 
fengggli commented 5 years ago
│Jun 28 18:57:19 sievert kernel: [18949.461145] feng: VFIO_IOMMU_MAP_DMA is called
│Jun 28 18:57:19 sievert kernel: [18949.461156] feng vfio_pinpage_remote failed with npage=-14
│Jun 28 18:57:19 sievert kernel: [18949.461159] feng: try vfio_pin_map_dma, but failed with -14
│Jun 28 18:57:19 sievert kernel: [18949.461183] feng: VFIO_IOMMU_MAP_DMA is called
│Jun 28 18:57:19 sievert kernel: [18949.461186] feng vfio_pinpage_remote failed with npage=-14
│Jun 28 18:57:19 sievert kernel: [18949.461188] feng: try vfio_pin_map_dma, but failed with -14
│Jun 28 18:57:19 sievert kernel: [18949.461277] mcas: vm_close
fengggli commented 5 years ago

Walker replied me in https://lists.01.org/pipermail/spdk/2019-July/003541.html I shall refer to https://lwn.net/Articles/375096/

fengggli commented 5 years ago

https://lwn.net/Articles/774411/

Williams said that he wanted to talk about APIs for revoke(), which would help with these problems where an mmap() region is shared and being used for DMA. If another process wants to truncate or punch a hole in the file in the region where DMA is being done, "you are screwed", at least for DAX.

dickeycl commented 5 years ago

@fengggli Using systemtap, it looks like the vma->vm_flags is 0xd0444fb, which has both VM_IO and VM_PFNMAP bits on. check_vma_flags fails if either of those bits is on, and get_user_pages returns -EFAULT.

In vfio, code near the end of vaddr_get_pfn should detect that the vma has VM_PFNMAP set, and should override the -EFAULT returned by get_user_pages. Somehow, the override does not seem to happen. Adding some print near the override logic shows that the PFN seems to fail the "is_invalid_reserved_pfn" test. As it is a valid PFN by the address computation, the only hope of convincing vfio to amend the return value seems to be the "reserved" aspect.

[ 8555.578001] vaddr_get_pfn: VMA 0000000045962a46 [ 8555.578003] vaddr_get_pfn: PFN feeb [ 8555.578005] vaddr_get_pfn: RET -14

dickeycl commented 5 years ago

Systemtap script I ran to trace the functions and get the data values:

probe kernel.function("get_user_pages") { printf("get_user_pages mm %p start %lu nr_pages %lu gup_flags 0x%lx\n", $mm, $start, $nr_pages, $gup_flags); }

probe kernel.statement("get_user_pages@mm/gup.c:649") { printf("get_user_pages 649 $vma %p 0x%lx\n", $vma, $gup_flags); if ( $vma ) { printf("vm_flags 0x%lx\n", $vma->vm_flags); } }

probe kernel.function("__get_user_pages").return { printf("%s\n", $$return); }

probe kernel.function("find_extend_vma") { printf("find_extend_vma addr %lx\n", $addr); }

probe kernel.function("find_extend_vma").return { printf("find_extend_vma return %s\n", $$return); }

probe kernel.function("get_gate_page") { printf("get_gate_page\n"); }

probe kernel.function("check_vma_flags") { printf("check_vma_flags\n"); }

fengggli commented 5 years ago

@fengggli Using systemtap, it looks like the vma->vm_flags is 0xd0444fb, which has both VM_IO and VM_PFNMAP bits on. check_vma_flags fails if either of those bits is on, and get_user_pages returns -EFALUT.

systemtap and your script are very helpful, I also configured in my side and saw the 0xd0444fb in my side.

In vfio, code near the end of vaddr_get_pfn should detect that the vma is has VM_IO set, and should override the -EFAULT returned by get_user_pages. Somehow, the override does not seem to happen.

Are you referring to the VM_PFNMAP flag(instead of VM_IO) in https://elixir.bootlin.com/linux/v4.15/source/drivers/vfio/vfio_iommu_type1.c#L367 ?(I am using 4.15 kernel and didn't see vfio will check VM_IO in the end of vaddr_get_pfn..)

dickeycl commented 5 years ago

Yes, will correct.

dickeycl commented 5 years ago

@fengggli Pg_reserved should be set for pages which represent DAX memory, according to https://lkml.org/lkml/2018/12/14/290. And that should pass the is_invalid_reserved_pfn test. But the memory in the test case is not DAX memory, and so fails the test.

I tried to use simulated DAX as described in https://pmem.io/2016/02/22/pm-emulation.html. That did not work. I still did not see the PG_reserved flag.

fengggli commented 5 years ago

I also didn't see where the PG_reserved flag is set.

Are we holding those two assumptions in below currently? (I got easlier distracted in the kernel source... and this helps a lot to narrow down the problem)

  1. in the vaddr_get_pfn function, all get_userpages* functions will fail(since we have already pinned the memory and we had VM_PFNMAP set),
  2. we expect is_invalid_reserved_pfn will return true, but currently it doesn't.

I am still searching for possible places the flag is set, meanwhile I did found some related information in below

dickeycl commented 5 years ago

@fengggli Yes to both assumptions. Note that the second bullet item is outdated; see the last comment on the entry. I was using memory specified in the linux cmdline by memmap=8G!240G. There is an alternate specification (see https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html) using $, memmap=8G$240G, which claims to "reserve" memory. If this is the same "reserve" as PG_reserved, that may be how to get the pages marked as reserved. According to that reference, there are four characters which an be used to specify memory: @ E820_TYPE_RAM - force use (presumably as normal memory) (hash mark) E820_TYPE_ACPI - mark as ACPI $ E820_TYPE_RESERVED - mark as reserved ! E820_TYPE_PRAM - mark as protected (I think "protected" is a typo, and "persistent" was intended)

The pmem emulation documentation at https://pmem.io/2016/02/22/pm-emulation.html mentions only !, and uses the word "Reserve" to describe its effect. I suspect that "reserve" means different things to different people.

It is a bit tricky to get the $ properly escaped for grub2. In /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT='hugepagesz=2M hugepages=2048 intel_iommu=on memmap=8G\$240G'

Having done that, I do not see /dev/pmem0. So perhaps whatever creates /dev/pmem does not recognized regions specified with the $.

fengggli commented 5 years ago

@fengggli Yes to both assumptions. Note that the second bullet item is outdated; see the last comment on the entry. I was using memory specified in the linux cmdline by memmap=8G!240G. There is an alternate specification, using $, memmap=8G$240G, which claims to "reserve" memory. If this is the same "reserve" as PG_reserved, that may be how to get the pages marked as reserved. It is a bit tricky to get the $ properly escaped for grub2. In /etc/default/grub:

Oh, interesting. Did the memory register call succeed if using simulated pmem from reserved memory? (I will try it in my side too).

Also, we might also take a look how the real persistem memory(aep1) and nic device memory get mapped in the physical address.

My kernel log(without using $ in memmap) also marks some other ranges as "reserved", I guess those might be also reserved by system for DMA:

[  +0.000000] e820: user-defined physical RAM map:
...
[  +0.000000] user: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
[  +0.000000] user: [mem 0x00000000ff800000-0x00000000ffffffff] reserved
[  +0.000000] user: [mem 0x0000000100000000-0x00000007ffffffff] usable
[  +0.000000] user: [mem 0x0000000800000000-0x000000087fffffff] persistent (type 12)
[  +0.000000] user: [mem 0x0000000880000000-0x00000008ffffffff] persistent (type 12)
[  +0.000000] user: [mem 0x0000000900000000-0x000000107fffffff] usable

GRUB_CMDLINE_LINUX_DEFAULT='hugepagesz=2M hugepages=2048 intel_iommu=on memmap=8G$240G'