Closed ramtej closed 2 weeks ago
To me, it seems that the EPS DSP "C" examples are merely ASM optimized and do not use any DSP functions!? Is that the case?
Since these extra instructions are all integer instructions, only fixed-point FFT uses them: https://github.com/espressif/esp-dsp/blob/71514173b58b960173b40c4ade9d15d372770a74/modules/fft/fixed/dsps_fft2r_sc16_aes3.S
Yes, that makes sense. This means that the DSP primitives have already been spotted in the wild and not only described in the above paper.
My approach now would be to fork or extend e.g. https://gitlab.com/teskje/microfft-rs that similar to the SHA HW acceleration use the DSP primitives. Specifically, I would want to replace the radix-2 butterfly computation with the DSP HW functions.
Another (maybe stupid) idea could be to use the inline assembler asm!{} macro. Especially because the S3 target supports beside the DSP also some nice SIMD operations.
The follow compiles and even runs on the xtensa platform:
std::hint::black_box(unsafe {
asm!("nop");
});
Now there is a lot between the above code and the universe that finally runs my code, especially the LLVM. I have dug into the esp Rust branch a bit and it looks like at least something was done in that direction:
pub enum InlineAsmArch {
X86,
X86_64,
Arm,
..
Xtensa,
..
}
Is that enough? I had naively just used a random DSP mnemonic - the error message is "mnemonic unknown".
Does anyone have an idea on the topic?
Thanks, Jiri
@ramtej AFAIK, LLVM plain old doesn't support/recognize the hundreds of DSP instructions, so to use inline assembly it would require either adding them all, or using the escaped binary opcode (but that also means specifying registers by number)
I wish support could be added, but it's a non-trivial amount of search and replace in the llvm-project codebase.
@MabezDev showed me this "trick" a good while ago
@zRedShift Is the ".byte 0x00, 0x30, 0x00" sequence an inline assembly instruction specified as a sequence of bytes, serving as a workaround when the desired assembly instruction isn't supported by the LLVM? Indeed, this is a neat trick; I'll give it a try.
Thanks!
There is a nice crate that does the '.byte' encoding for the RISC-V V extension instructions - https://github.com/cryptape/rvv-encoder. It would be exciting to have something similar for the Xtensa extension instructions.
unsafe {
xtensa_asm::asm!(
..
"ee.cmul.s16 q3,q2,q1,3",
..
);
}
@ramtej awesome find. This will also help with ESP32P4's custom RISC V DSP extensions, when it comes out.
I haven't investigated it yet (I was actually planning on just writing .S
files and compiling/linking them), but I wonder, is there support for rur.*/wsr.*
etc instructions in llvm (chapter 1.6.10 in the ESP32S3 technical reference, Processor Control Instructions). They are neccessary to manipulate special registers that control stuff like FFT width in fixed point mode.
Yes, probably the easiest thing for now will be to just link in the .S
files. I think with the rur.*/wsr.*
it will be similar to the other instructions. I need (i)FFT for my application and therefore I try to get the maximum out of the S3. The DSP benchmarks are promising, but I need the functions on Rust level. Maybe it makes sense to develop an esp-dsp-rs
crate for the current Xtensa and the future RISC V DSP extensions.
I've been warming up with some Xtensa LLVM backend contributions over the last few days. My end goal is fast (scalar or vector) DSP, for ESP32S3 in Rust, since I'm working on real time audio processing/encoding and I need to maximize the performance to cram in as much processing into the pipeline. This is obviously DCT/FFT/FIR etc. heavy, among other things.
So a few more scalar instruction PRs, and I will move on to adding the 128-bit registers and instructions, so that the inline assembly can be supported directly. It's been a dream of mine for about a year to do it, at the time the ESP32S3 techinical manual still didn't include the extensions, and reverse engineering bfd/xtensa-modules.c
was a major pain, and by the time they released the instructions in the reference, the priorities shifted. Now is the time to do the work.
Maybe even add auto-vectorization support in the (far) future, but it's a daunting task since the instructions are pipelined, the cost tables need to be populated, and there's a huge amount of user registers that control the runtime behavior of the instructions.
Just in case, cc @sstefan1 who is planning to merge the initial support for ESP32-S3 DSP instructions into Espressif's LLVM fork soon.
Don't want to step on anyone's toes, if @sstefan1 has already started work on this, I won't pursue, unless I can somehow assist?
Currently we have all ESP32-S3 DSP instructions implemented in LLVM. All instructions are available in clang through clang's builtins, which translate to llvm intrinsics and then to appropriate instructions.
For example:
__builtin_xtensa_ee_vld_128_ip(1, data, 0); --> ee.vld.128.ip q1, a9, 0
This work should be merged soon.
I need to investigate how that should be done in rust, though. If anybody already knows, please let me know.
@sstefan1 Well, we don't currently have any xtensa intrinsics support, but it would go to stdarch/core_arch here. It lives out of tree, so will need to be forked by the esp-rs
org and the .gitmodules
at esp-rs/rust should point to it. Then they can be added just like in clang. This can be done separately, whenever, and not a blocker for initial support.
For inline assembly support, it's much simpler since most of the base work has already been done by @MabezDev here
All that needs to be done is add the qregs
support, and the user regs (FFT_WIDTH
, QACC_H_0
, etc.).
If you can get a branch/PR running on espressif/llvm-project, I can start work on initial support/testing on Rust for this. I'm already working on those files, adding rust support for the clamps/minmax
features based on this PR.
Ok, it looks to me like there is enough incentive and brainpower to tackle the ESP+Rust+DSP challenge. How are we going to coordinate this? Who does what?
@ramtej as soon as the esp32s3 changes land on espressif/llvm-project or one of its branches, I'll start working on the PR for esp-rs/rust.
I've merged the initial support for ESP32S3 DSP instructions in llvm. Just to keep in mind, builtins support is not yet very well tested. I will be doing testing in the following weeks.
One more note, llvm-objdump currently doesn't work correctly with DSP instructions. To check generated assembly, it is best to use llc.
Here's my branch with experimental support of this in Rust.
I ran into some issues/funky business (with the immediate addressing constant in ee.vld.128.ip
, which I think is an issue with llvm, since the constant is correct in the generated llvm ir, I'll investigate later), but the core of it works.
@sstefan1
declare void @llvm.xtensa.ee.vld.128.ip(i32, i32, i32) nounwind
define void @test2(i32 %p){
tail call void @llvm.xtensa.ee.vld.128.ip(i32 5, i32 %p, i32 16)
ret void
}
If I generate assembly with llc
I get the correct assembly:
// llc -O1 -mtriple=xtensa -mcpu=esp32s3 < xtensa-s3-ee-vld-128-ip.ll
.text
.file "<stdin>"
.global test2 # -- Begin function test2
.p2align 2
.type test2,@function
test2: # @test2
.cfi_startproc
# %bb.0:
entry a1, 32
.cfi_def_cfa_offset 32
ee.vld.128.ip q5, a2, 16
retw.n
.Lfunc_end0:
.size test2, .Lfunc_end0-test2
.cfi_endproc
# -- End function
.section ".note.GNU-stack","",@progbits
But if I generate and object file with llc and run gcc objdump (xtensa-esp32s3-elf-objdump
) I get an issue:
// llc -O1 -mtriple=xtensa -mcpu=esp32s3 -filetype=obj < xtensa-s3-ee-vld-128-ip.ll > test.o
// xtensa-esp32s3-elf-objdump -D test.o
test.o: file format elf32-xtensa-le
Disassembly of section .text:
00000000 <test2>:
0: 004136 entry a1, 32
3: a39024 ee.vld.128.ip q5, a2, 0x100
6: f01d retw.n
We get 0x100 instead of 0x10, and this happens to all other numbers, only 0 is unaffected.
I will have to look into it. BTW, I'm not sure if xtensa-esp32s3-elf-objdump disassembles DSP instructions correctly either. I had some problems while testing. I will check the llvm as well, but just mentioning I had issues with disassembling too.
Sure, I didn't trust the objdump blindly, I made sure I tested it on one of my ESPS3S3 chips. I ran this memcpy test:
#[repr(align(16))]
pub struct AlignedArray<const N: usize>([u8; N]);
#[inline(never)]
pub unsafe fn aligned_memcpy_test<const N: usize>(dst: &mut AlignedArray<N>, src: &AlignedArray<N>) {
let src_addr = src.0.as_ptr();
let dst_addr = dst.0.as_mut_ptr();
assert!(src_addr.is_aligned_to(16));
assert!(dst_addr.is_aligned_to(16));
assert_eq!(N % 32, 0);
for _ in 0..N / 32 {
core::arch::asm!(
r#"
ee.vld.128.ip q0, {src_addr}, 16
ee.vld.128.ip q1, {src_addr}, 16
ee.vst.128.ip q0, {dst_addr}, 16
ee.vst.128.ip q1, {dst_addr}, 16
"#,
src_addr = in(reg) src_addr,
dst_addr = in(reg) dst_addr,
);
}
}
src: [0, 1, 2, 3, 4, 5, 6, 7, ...., 0, 1, 2, 3, 4, 5,...] dst: [0, 1, 2, 3, ..., 14, 15, 0, 0, 0 ..., 0, 1, 2, 3, ... , 14, 15, ..., 0, 0, 0, ...] So the loop is jumping with an offset 320 instead of 32, just like the objdump.
Sure, I didn't trust the objdump blindly, I made sure I tested it on one of my ESPS3S3 chips. I ran this memcpy test:
#[repr(align(16))] pub struct AlignedArray<const N: usize>([u8; N]); #[inline(never)] pub unsafe fn aligned_memcpy_test<const N: usize>(dst: &mut AlignedArray<N>, src: &AlignedArray<N>) { let src_addr = src.0.as_ptr(); let dst_addr = dst.0.as_mut_ptr(); assert!(src_addr.is_aligned_to(16)); assert!(dst_addr.is_aligned_to(16)); assert_eq!(N % 32, 0); for _ in 0..N / 32 { core::arch::asm!( r#" ee.vld.128.ip q0, {src_addr}, 16 ee.vld.128.ip q1, {src_addr}, 16 ee.vst.128.ip q0, {dst_addr}, 16 ee.vst.128.ip q1, {dst_addr}, 16 "#, src_addr = in(reg) src_addr, dst_addr = in(reg) dst_addr, ); } }
src: [0, 1, 2, 3, 4, 5, 6, 7, ...., 0, 1, 2, 3, 4, 5,...] dst: [0, 1, 2, 3, ..., 14, 15, 0, 0, 0 ..., 0, 1, 2, 3, ... , 14, 15, ..., 0, 0, 0, ...] So the loop is jumping with an offset 320 instead of 32, just like the objdump.
Hi @zRedShift, I was on vacation and wasn't able to look at this earlier. LLVM backend was encoding the imm16
offset as the actual immediate value, but it should actually encode the multiple of 16. So for 32 it should encode 0x02
for 48 it should encode 0x03
and so on. I will post a fix internally and we should have it on the github repo soon.
@sstefan1 Thank you. I suspected something like that but didn't have time to look into it over the last few weeks. Glad it's been resolved.
I remember also encountering that it was impossible to use loop/loopnez/loopgtz
in the inline assembly, but since it's not related to the DSP instructions, and hardware loops are already planned to be fixed/included in the future, I didn't investigate it further.
Bumping this issue as I was researching memcpy
on ESP32-S3. It turns out that aligned memcpy with EE.VLD instructions can be 6 times faster than regular memcpy, quoting https://github.com/project-x51/esp32-s3-memorycopy:
I (404) Memory Copy: Allocating 2 x 100kb in IRAM, alignment: 32 bytes
I (464) Memory Copy: 8-bit for loop copy IRAM->IRAM took 819922 CPU cycles = 28.59 MB/s
I (514) Memory Copy: 16-bit for loop copy IRAM->IRAM took 205776 CPU cycles = 113.90 MB/s
I (564) Memory Copy: 32-bit for loop copy IRAM->IRAM took 103383 CPU cycles = 226.71 MB/s
I (614) Memory Copy: 64-bit for loop copy IRAM->IRAM took 77682 CPU cycles = 301.71 MB/s
I (664) Memory Copy: memcpy IRAM->IRAM took 64323 CPU cycles = 364.37 MB/s
I (714) Memory Copy: async_memcpy IRAM->IRAM took 408520 CPU cycles = 57.37 MB/s
I (764) Memory Copy: PIE 128-bit (16 byte loop) IRAM->IRAM took 19498 CPU cycles = 1202.05 MB/s
I (814) Memory Copy: PIE 128-bit (32 byte loop) IRAM->IRAM took 13095 CPU cycles = 1789.81 MB/s
I (864) Memory Copy: DSP AES3 IRAM->IRAM took 15813 CPU cycles = 1482.17 MB/s
It would be great if we can have this in Rust (or is it already done?)
Sure, I didn't trust the objdump blindly, I made sure I tested it on one of my ESPS3S3 chips. I ran this memcpy test:
#[repr(align(16))] pub struct AlignedArray<const N: usize>([u8; N]); #[inline(never)] pub unsafe fn aligned_memcpy_test<const N: usize>(dst: &mut AlignedArray<N>, src: &AlignedArray<N>) { let src_addr = src.0.as_ptr(); let dst_addr = dst.0.as_mut_ptr(); assert!(src_addr.is_aligned_to(16)); assert!(dst_addr.is_aligned_to(16)); assert_eq!(N % 32, 0); for _ in 0..N / 32 { core::arch::asm!( r#" ee.vld.128.ip q0, {src_addr}, 16 ee.vld.128.ip q1, {src_addr}, 16 ee.vst.128.ip q0, {dst_addr}, 16 ee.vst.128.ip q1, {dst_addr}, 16 "#, src_addr = in(reg) src_addr, dst_addr = in(reg) dst_addr, ); } }
src: [0, 1, 2, 3, 4, 5, 6, 7, ...., 0, 1, 2, 3, 4, 5,...] dst: [0, 1, 2, 3, ..., 14, 15, 0, 0, 0 ..., 0, 1, 2, 3, ... , 14, 15, ..., 0, 0, 0, ...] So the loop is jumping with an offset 320 instead of 32, just like the objdump.
Just tested this on current esp-rs/rust
and it seems to work well.
Hi, I'm attempting to use the SIMD instructions with latest 1.80 release. Most instructions work as they're intended, but I've encountered a number of misassemblies especially on the arithmeric+load/store instructions. As an example:
asm!(
"NOP",
"NOP",
"EE.VADDS.S8.LD.INCP q0, a15, q1, q2, q3",
"NOP",
"NOP",
)
Ends up as
420876c5: 0020f0 nop
420876c8: 0020f0 nop
420876cb: cf .byte 0xcf
420876cc: 0299 s32i.n a9, a2, 0
420876ce: f01c movi.n a0, 31
420876d0: f00020 subx8 a0, a0, a2
420876d3: f00020 subx8 a0, a0, a2
(as disassembled by xtensa-esp-elf-objdump
). As you can see, the instruction bytes are quite wrong, and executing does lead to IllegalInstruction exceptions.
So far I've observed this for instructions in the *.LD/ST.INCP
group at least, but I would not be surprised if more were broken.
@Noxime The latest 1.82 toolchain includes LLVM 18 which I believe has more (all?) of these instructions implemented - please retry and file a new issue if its still occurring.
@ProfFan It would be great if we can have this in Rust (or is it already done?)
memcpy is a weak symbol in compiler builtins, you can override it (we already use the ROM memcpy which might already do this btw)
Closing this for now.
I am currently benchmarking some DSP routines on the ESP32 S3 platform in Rust. Several issues have already arisen, see https://github.com/esp-rs/rust/issues/180.
Upon reading the 'ESP32-S3 Technical Reference Manual', it became apparent that the S3 platform implements some SIMD as well as DSP operations in hardware, such as
EE.FFT.R2BF.S16
orEE.CMUL.S16
. I would be willing to invest some time and and implement a hardware-accelerated FFT. I am more of a mathematician and do not know the ESP32 architecture well enough, so I need some support.Would it therefore be possible for someone to guide me and show me where I need to start? I looked at the SHA HW acceleration and can understand most things, but not all.
Thanks, Jiri