BelaPlatform / Bela

Bela: core code, IDE and lots of fun!
Other
490 stars 139 forks source link

optimize memory performance for PRU/CPU #313

Open giuliomoro opened 7 years ago

giuliomoro commented 7 years ago

Accessing PRU RAM from ARM is slow. Currently it is accessed in chunks of 2 bytes at a time, which is VERY slow. We have two options (broadly speaking):

  1. access PRU RAM in larger chunks. Bonus: re-organize PRU memory so that it makes it easier to copy stuff (e.g.: [MCASP_DAC_0 MCASP_ADC_0 SPI_DAC_0 SPI_ADC_0 .... MCASP_DAC_1 MCASP_ADC_1 SPI_DAC_1 SPI_ADC_1]
  2. write to system RAM from the PRU. Bonus: do int/float conversion and inter/deinterleaving on the PRU
giuliomoro commented 7 years ago

The proof of concept in https://github.com/giuliomoro/Bela/commit/9008e32184731c174daca7da771c618609d49f19 shows that 10% CPU is due to PRU memory transfers. There, larger blocks of memory are transferred (i.e.: all SPI_DAC at the same time, all MCASP_ADC at the same time, etc). This saves 7% CPU at 16 samples per block. However, the 4 memcpy lines still add an overhead of 3%, which could be saved by implementing (2)(assuming it works).

Once the memory copies are factored out, it seems that only 0.8% CPU is devoted to format conversion ( int -> float and vice-versa) when using interleaved without resampling ( thus making part of the "bonus steps" for (2) useless, however the option to handle (non)interleaving from the PRU is appealing). More on performance: it seems that by continueing the loop after the "interrupt" is "acknowledged" (i.e.: lastPRUBuffer = pru_buffer_comm[PRU_CURRENT_BUFFER];), the process still uses 4.2% CPU (probably due to the 4 context changes taking place when sleeping, so this could be cut down with interrupts).

I think that without a refactoring of the PRU memory (bonus step of (1)), the implementation in https://github.com/giuliomoro/Bela/commit/9008e32184731c174daca7da771c618609d49f19 will be penalized at smaller block sizes.

Also note that the digital channels are not affected by these changes, but surprisingly they seem not to affect the performance much (if at all).

giuliomoro commented 7 years ago

Performance using the rtdm driver. Measured with https://github.com/BelaPlatform/Bela/commit/88297d69c4cbe291c12434a9655ae9891b271352 (note : in a later commit I added a sleep before waiting for the interrupt, which adds about 2%).

So CPU usage with 16 samples per block: 1.3% : while(!gShouldStop) read(...) bare minimum. But in the real world you need to add:

Some deeper analysis on the burden of copying memory. Made a fake loop without format conversion:

                while(!gShouldStop)
                {
                                read(rtdm_fd, &value, sizeof(value));
                                int pruBufferForArm = pru_buffer_comm[PRU_CURRENT_BUFFER] == 0 ? 1 : 0;
                                pruMemory->copyFromPru(pruBufferForArm);
// here you would normally have format conversion, render(), format conversion
                                pruMemory->copyToPru(pruBufferForArm);
                }

This takes 6%. I then tried to artificially increase the memory copied around, while keeping the same blocksize. 1x gives 6% total. 2x gives 10.5% total. 3x gives 15% total.  4x gives 19.6% total. 5x gives 24% total. That is, a fixed increment of about 4.5% "per blocksize".

For reference, when using memcpy for ARM memory only, these figures become 1x gives 1.6% total. 2x gives 1.7% total. 3x gives 1.9% total.  4x gives 2.0% total. 5x gives 2.1% total. That is, a fixed incement of about 0.1% "per blocksize"

Bottom line: We are still losing 4.4% CPU because we are reading from PRU memory. I think it is worth revisiting this in the future trying to get the PRU to access the ARM RAM directly, or using a DMA, though the DMA options seems more complicated (the IRQ line for the DMA is needed by Linux for some critical system stuff) and would most likely slightly increase latency (need to wait for the DMA transfer to finish before being able to process the block!)

mvduin commented 6 years ago

The cortex-A8 cannot read from DDR3 memory any faster than it can read from PRUSS memory. In fact, the access latency is likely to be higher for DDR3 memory.

The type of page mappings used by the cortex-A8 probably does have an effect on performance. Preferably they should be "normal non-cacheable" for this, rather than "device" or "strongly ordered". Using cached memory and explicit cache management might be faster than non-cached, but I doubt it when using big neon reads, plus cache management can only be done from kernelspace.

mvduin commented 6 years ago

I just dug in some timings I did a long time ago on a cortex-A8 based TI SoC closely related to the AM335x. Using a simple memcpy function that uses 16-byte neon load and store to copy data from normal non-cacheable memory to normal memory was 8 times faster than copying from device memory to normal memory. In contrast, writing to device memory (using 16-byte writes) didn't seem to be any slower than writing to normal non-cacheable memory.

For reference, here are the timings I got then (in cycles/byte): https://goo.gl/oPfwuv Columns are memory type and L1/L2 cache behaviour of the source region. Rows are the destination region. Two copy functions were tested: bmove was bytewise (ldrb/strb) while qmove used 16-byte chunks (vld1/vst1). L1 cache was cleared before each test, L2 cache was either cleared or preallocated depending on the test. "sync" refers to the strongly-ordered memory type. Cache behaviours: — = not cacheable ra (read-allocate) = first read of cacheline misses and allocates nwa (no-write-allocate) = all writes miss cache and do not allocate wa (write-allocate) = first write of cacheline misses and allocates (L2 only) hit = cache was preallocated and all reads/writes hit in cache (L2 only)

giuliomoro commented 5 years ago

This could come in handy https://stackoverflow.com/questions/34888683/arm-neon-memcpy-optimized-for-uncached-memory