Wallacoloo / Raspberry-Pi-DMA-Example

Simplest example of copying memory from one region to another using DMA in userland
The Unlicense
93 stars 26 forks source link

L2 cache bypassing allocation hangs the system #3

Open juj opened 6 years ago

juj commented 6 years ago

In an attempt to get DMA access to Raspberry Pi 3 working from user space, ported the safe L2 cache bypassing allocation code from dma-gpio.c over to dma-example.c, the code is available at

https://github.com/juj/Raspberry-Pi-DMA-Example/commit/fcc8abef6705b54f3eaf592dc38498da2ea3f7c2#diff-7f20fb2096b20c653020741d0a12f09bR326

No matter what I try, I get the above code to hang the pi (requiring hard power off and on the device), usually printing out

pi@raspi:~/code/Raspberry-Pi-DMA-Example $ sudo ./dma-example
memfd: 3, pagemapfd: 4
mapped: 0x76f62000
destination was initially: ''
sleep

without ever getting over the sleep(1); command to print out the next sleep done message. I wonder if there is something in the above ported code you'd be able to catch out?

(as a sidenote, had to change the used DMA channel from default 5 to 8, since sniffing the DMA channels' CS registers showed that DMA channel 5 is infrequently in use on the system (one transfer per a few minutes), not sure by what though.

juj commented 6 years ago

Reducing the test case further,

https://github.com/juj/Raspberry-Pi-DMA-Example/commit/4dd506b09f133a5828d2e2d41de55f631c4e0769

has an example that does an allocation of a single page, and then tries to create an uncached view to the page, and memset that page to zero. The result is

$ g++ -o uncached-memory-allocation uncached-memory-allocation.cpp
$ sudo ./uncached-memory-allocation
Allocating uncached memory
Created uncached memory view, memsetting it to zero
Sleep before memset done
Memset done

after which the Pi hangs, never printing out "Memset done and slept some".

It is as if it's crashing on the line that sleeps for one second, whereas the sleep before the memset worked just fine, which suggests some kind of asynchronous behavior in the memset that takes the system down. Commenting out the memset makes the test pass, so there's something bad about accessing that uncached page.

I wonder if this method of allocating uncached pages in user space could be fixed somehow? It'd have great potential to optimize the power consumption of my fbcp-ili9341 driver :)

gba_quake2

juj commented 6 years ago

Playing around, changing this line from

 return virtToPhys(virt, pagemapfd) | 0x40000000;

to

 return virtToPhys(virt, pagemapfd);

i.e. not taking the physical page over to uncached address range, the crash is avoided (although naturally also the memory will then not be uncached), so this suggests there is something going awry in the physical uncached address scheme, perhaps rather than the mmap() functions for example. Not sure if that's much helpful in finding a solution though.

Wallacoloo commented 6 years ago

@juj It looks to me like you've confirmed issue #2 for the RPi 3.

In the Pi 1, DMA was blind to the L1 cache, so I had to write to (and probably read from?) a separate bus address that's L2-coherent (0x4000_0000 - 0x6000_0000). These mappings come from here

It wouldn't surprise me if these bus mappings were changed between different RPi revisions - especially as the Pi3 has more ram. Specifically, I found this page - it looks like possibly the following line will work:

 return virtToPhys(virt, pagemapfd) | 0xc0000000; // or virtToPhys(virt, pagemapfd) bcm_host_get_sdram_address() if you can find which header declares this function.

I'm a little unsure of whether this is just bypassing the L1 cache or writing directly to SDRAM though (I'd guess the former, because if the L2-cache is visible to all bus devices, having a dedicated memory range for bypassing it doesn't seem very practical).

juj commented 6 years ago

@juj It looks to me like you've confirmed issue #2 for the RPi 3.

Thanks for posting that, commented there about what I've found out so far.

In the Pi 1, DMA was blind to the L1 cache, so I had to write to (and probably read from?) a separate bus address that's L2-coherent (0x4000_0000 - 0x6000_0000).

I am probably mistaken in my understanding on which level of cache the DMA peripheral is coherent with, and which level it does not see. I though the DMA controller did not see the L2 cache, but was coherent with the main ARM CPU with read/writes to the L1 cache, although now when I search where I got that understanding, I can't figure it out, so that is probably wrong information.

Reading out resources on the web about Pi 3, I also got the understanding that instead of

 return virtToPhys(virt, pagemapfd) | 0x40000000;

on Pi 1, I should do

 return virtToPhys(virt, pagemapfd) | 0xc0000000;

on Pi 3, and tried that out in the test snippet, but unfortunately that did not produce any different result, and attempting to write to the resulting virtual address causes Pi 3 system to hang as well.

The suspect line in my mind is this mmap, and my reasoning is that I would expect that after that mmap line returns, I should be able to query the returned virtual address in /proc/self/pagemap file to confirm that the pagemap tells me the physical (bus) address that this newly mapped virtual address points to. However /proc/self/pagemap states that virtual page is not mapped to physical memory, and returns that this mmaped "uncached view" virtual address points to physical address 0. The mmap call itself did not fail though, so I suspect it's returning an address that points to zero, and no doubt the result is then that writing to that memory will cause a crash.

Although I'm surprised that the whole Pi system crashes, and not just the calling process on access violation - perhaps it's not an access violation and Pi is then happily overwriting its own low memory addresses, hence the complete system fail.

My current plan of action is to abandon this mmap based approach, and attempt to utilize the mailbox interface to allocate pages of memory that would bypass the cache. From reading forum posts, that sounds like it should be a possible way to go from user space. I'll post back if I manage to find success with that.

juj commented 6 years ago

Utilizing the mailbox interface, I was able to get DMA now working. The code I ended up with reads as

struct GpuMemory
{
  uint32_t allocationHandle;
  void *virtualAddr;
  uintptr_t busAddress;
  uint32_t sizeBytes;
};

// Sends a pointer to the given buffer over to the VideoCore mailbox. See https://github.com/raspberrypi/firmware/wiki/Mailbox-property-interface
void SendMailbox(void *buffer)
{
  int vcio = open("/dev/vcio", 0);
  if (vcio < 0) FATAL_ERROR("Failed to open VideoCore kernel mailbox!");
  int ret = ioctl(vcio, _IOWR(/*MAJOR_NUM=*/100, 0, char *), buffer);
  close(vcio);
  if (ret < 0) FATAL_ERROR("SendMailbox failed in ioctl!");
}

// Defines the structure of a Mailbox message
template<int PayloadSize>
struct MailboxMessage
{
  MailboxMessage(uint32_t messageId):messageSize(sizeof(*this)), requestCode(0), messageId(messageId), messageSizeBytes(sizeof(uint32_t)*PayloadSize), dataSizeBytes(sizeof(uint32_t)*PayloadSize), messageEndSentinel(0) {}
  uint32_t messageSize;
  uint32_t requestCode;
  uint32_t messageId;
  uint32_t messageSizeBytes;
  uint32_t dataSizeBytes;
  union
  {
    uint32_t payload[PayloadSize];
    uint32_t result;
  };
  uint32_t messageEndSentinel;
};

// Message IDs for different mailbox GPU memory allocation messages
#define MEM_ALLOC_MESSAGE 0x3000c // This message is 3 u32s: numBytes, alignment and flags
#define MEM_FREE_MESSAGE 0x3000f // This message is 1 u32: handle
#define MEM_LOCK_MESSAGE 0x3000d // 1 u32: handle
#define MEM_UNLOCK_MESSAGE 0x3000e // 1 u32: handle

// Memory allocation flags
#define MEM_ALLOC_FLAG_DIRECT (1 << 2) // Allocate uncached memory that bypasses L1 and L2 cache on loads and stores

// Sends a mailbox message with 1xuint32 payload
uint32_t Mailbox(uint32_t messageId, uint32_t payload0)
{
  MailboxMessage<1> msg(messageId);
  msg.payload[0] = payload0;
  SendMailbox(&msg);
  return msg.result;
}

// Sends a mailbox message with 3xuint32 payload
uint32_t Mailbox(uint32_t messageId, uint32_t payload0, uint32_t payload1, uint32_t payload2)
{
  MailboxMessage<3> msg(messageId);
  msg.payload[0] = payload0;
  msg.payload[1] = payload1;
  msg.payload[2] = payload2;
  SendMailbox(&msg);
  return msg.result;
}

#define BUS_TO_PHYS(x) ((x) & ~0xC0000000)

// Allocates the given number of bytes in GPU side memory, and returns the virtual address and physical bus address of the allocated memory block.
// The virtual address holds an uncached view to the allocated memory, so writes and reads to that memory address bypass the L1 and L2 caches. Use
// this kind of memory to pass data blocks over to the DMA controller to process.
GpuMemory AllocateUncachedGpuMemory(uint32_t numBytes)
{
  GpuMemory mem;
  mem.sizeBytes = ALIGN_UP(numBytes, PAGE_SIZE);
  mem.allocationHandle = Mailbox(MEM_ALLOC_MESSAGE, /*size=*/mem.sizeBytes, /*alignment=*/PAGE_SIZE, /*flags=*/MEM_ALLOC_FLAG_DIRECT);
  mem.busAddress = Mailbox(MEM_LOCK_MESSAGE, mem.allocationHandle);
  mem.virtualAddr = mmap(0, mem.sizeBytes, PROT_READ | PROT_WRITE, MAP_SHARED, mem_fd, BUS_TO_PHYS(mem.busAddress));
  if (mem.virtualAddr == MAP_FAILED) FATAL_ERROR("Failed to mmap GPU memory!");
  return mem;
}

void FreeUncachedGpuMemory(GpuMemory mem)
{
  munmap(mem.virtualAddr, mem.sizeBytes);
  Mailbox(MEM_UNLOCK_MESSAGE, mem.allocationHandle);
  Mailbox(MEM_FREE_MESSAGE, mem.allocationHandle);
}

and it is successfully giving me virtual addresses in user space that bypass the cache, and DMA is working nicely now.