animetosho / rapidyenc

SIMD accelerated yEnc en/decode C library
5 stars 0 forks source link

rapidyenc on RISC-V with RVV 1.0 (Armbian Ubuntu Noble, GCC 14) #5

Open sanderjo opened 3 months ago

sanderjo commented 3 months ago

Following your advice on https://github.com/sabnzbd/sabctools/issues/116#issuecomment-2117749041 ... work with rapidyenc ... and ...

Bingo! v-commands in the deassembled library

sander@bananapif3:~/git/rapidyenc/build$ objdump -d librapidyenc.so | awk '{ print $3 }' | sort -u | grep -E "^v"
vadd.vi
vadd.vv
vcompress.vm
vcpop.m
viota.m
vle8.v
vmadc.vi
vmadc.vv
vmand.mm
vmandn.mm
vmerge.vxm
vmnor.mm
vmnot.m
vmor.mm
vmsbf.m
vmseq.vi
vmseq.vv
vmseq.vx
vmsltu.vx
vmv1r.v
vmv2r.v
vmv.s.x
vmv.v.i
vmv.v.x
vmv.x.s
vmxnor.mm
vmxor.mm
vor.vx
vrgather.vv
vse8.v
vsetivli
vsetvli
vslide1down.vx
vslide1up.vx
vslidedown.vx
vsll.vi
vsrl.vi
vsrl.vv
vsub.vv
vsub.vx
vwmulu.vx
vzext.vf2
sander@bananapif3:~/git/rapidyenc/build$ cmake ..
-- The C compiler identification is GNU 14.0.1
-- The CXX compiler identification is GNU 14.0.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for posix_memalign
-- Looking for posix_memalign - found
-- Performing Test COMPILER_SUPPORTS_RVV
-- Performing Test COMPILER_SUPPORTS_RVV - Success
-- Performing Test COMPILER_SUPPORTS_ZBKC
-- Performing Test COMPILER_SUPPORTS_ZBKC - Success
-- Configuring done (5.9s)
-- Generating done (0.1s)
-- Build files have been written to: /home/sander/git/rapidyenc/build
sander@bananapif3:~/git/rapidyenc/build$ cmake --build . --config Release
[  2%] Building CXX object CMakeFiles/rapidyenc.dir/src/platform.cc.o
[  5%] Building CXX object CMakeFiles/rapidyenc.dir/src/encoder.cc.o
[  7%] Building CXX object CMakeFiles/rapidyenc.dir/src/encoder_sse2.cc.o
[ 10%] Building CXX object CMakeFiles/rapidyenc.dir/src/encoder_ssse3.cc.o
[ 13%] Building CXX object CMakeFiles/rapidyenc.dir/src/encoder_avx.cc.o
[ 15%] Building CXX object CMakeFiles/rapidyenc.dir/src/encoder_avx2.cc.o
[ 18%] Building CXX object CMakeFiles/rapidyenc.dir/src/encoder_vbmi2.cc.o
[ 21%] Building CXX object CMakeFiles/rapidyenc.dir/src/encoder_neon.cc.o
[ 23%] Building CXX object CMakeFiles/rapidyenc.dir/src/encoder_rvv.cc.o
[ 26%] Building CXX object CMakeFiles/rapidyenc.dir/src/decoder.cc.o
[ 28%] Building CXX object CMakeFiles/rapidyenc.dir/src/decoder_sse2.cc.o
[ 31%] Building CXX object CMakeFiles/rapidyenc.dir/src/decoder_ssse3.cc.o
[ 34%] Building CXX object CMakeFiles/rapidyenc.dir/src/decoder_avx.cc.o
[ 36%] Building CXX object CMakeFiles/rapidyenc.dir/src/decoder_avx2.cc.o
[ 39%] Building CXX object CMakeFiles/rapidyenc.dir/src/decoder_vbmi2.cc.o
[ 42%] Building CXX object CMakeFiles/rapidyenc.dir/src/decoder_neon.cc.o
[ 44%] Building CXX object CMakeFiles/rapidyenc.dir/src/decoder_rvv.cc.o
[ 47%] Building CXX object CMakeFiles/rapidyenc.dir/src/crc.cc.o
[ 50%] Building CXX object CMakeFiles/rapidyenc.dir/src/crc_folding.cc.o
[ 52%] Building CXX object CMakeFiles/rapidyenc.dir/src/crc_folding_256.cc.o
[ 55%] Building CXX object CMakeFiles/rapidyenc.dir/src/crc_arm.cc.o
[ 57%] Building CXX object CMakeFiles/rapidyenc.dir/src/crc_arm_pmull.cc.o
[ 60%] Building CXX object CMakeFiles/rapidyenc.dir/src/crc_riscv.cc.o
[ 63%] Building CXX object CMakeFiles/rapidyenc.dir/crcutil-1.0/code/crc32c_sse4.cc.o
[ 65%] Building CXX object CMakeFiles/rapidyenc.dir/crcutil-1.0/code/multiword_64_64_cl_i386_mmx.cc.o
[ 68%] Building CXX object CMakeFiles/rapidyenc.dir/crcutil-1.0/code/multiword_64_64_gcc_amd64_asm.cc.o
[ 71%] Building CXX object CMakeFiles/rapidyenc.dir/crcutil-1.0/code/multiword_64_64_gcc_i386_mmx.cc.o
[ 73%] Building CXX object CMakeFiles/rapidyenc.dir/crcutil-1.0/code/multiword_64_64_intrinsic_i386_mmx.cc.o
[ 76%] Building CXX object CMakeFiles/rapidyenc.dir/crcutil-1.0/code/multiword_128_64_gcc_amd64_sse2.cc.o
[ 78%] Building CXX object CMakeFiles/rapidyenc.dir/crcutil-1.0/examples/interface.cc.o
/home/sander/git/rapidyenc/crcutil-1.0/examples/interface.cc: In static member function ‘static crcutil_interface::CRC* crcutil_interface::CRC::Create(crcutil_interface::UINT64, crcutil_interface::UINT64, size_t, bool, crcutil_interface::UINT64, crcutil_interface::UINT64, size_t, bool, const void**)’:
/home/sander/git/rapidyenc/crcutil-1.0/examples/interface.cc:232:23: warning: unused parameter ‘use_sse4_2’ [-Wunused-parameter]
  232 |                  bool use_sse4_2,
      |                  ~~~~~^~~~~~~~~~
[ 78%] Built target rapidyenc
[ 81%] Building CXX object CMakeFiles/rapidyenc_shared.dir/rapidyenc.cc.o
[ 84%] Linking CXX shared library librapidyenc.so
[ 84%] Built target rapidyenc_shared
[ 86%] Building CXX object CMakeFiles/rapidyenc_static.dir/rapidyenc.cc.o
[ 89%] Linking CXX static library rapidyenc_static/librapidyenc.a
[ 89%] Built target rapidyenc_static
[ 92%] Building C object CMakeFiles/rapidyenc_cli.dir/tool/cli.c.o
[ 94%] Linking CXX executable rapidyenc_cli
[ 94%] Built target rapidyenc_cli
[ 97%] Building CXX object CMakeFiles/rapidyenc_bench.dir/tool/bench.cc.o
[100%] Linking CXX executable rapidyenc_bench
[100%] Built target rapidyenc_bench
sander@bananapif3:~/git/rapidyenc/build$ 
sanderjo commented 3 months ago
sander@bananapif3:~/git/rapidyenc/build$ ./rapidyenc_bench 
Encode (unknown): 608.665 MB/s
Decode (unknown): 778.601 MB/s
CRC32 (generic): 418.045 MB/s
CRC32 256^n: 0.413567 Mop/s

For reference: on my laptop with 11th Gen Intel(R) Core(TM) i3-1115G4 @ 3.00GHz:

(base) sander@zwart2204:~/git/rapidyenc/build$ ./rapidyenc_bench 
Encode (VBMI2): 8118.54 MB/s
Decode (VBMI2): 14876.9 MB/s
CRC32 (VPCLMUL): 20696.9 MB/s
CRC32 256^n: 63.5324 Mop/s
animetosho commented 3 months ago

Thanks for testing.

I think they compare the K1 against a Cortex A55 - so that seems to be respectable. This page says it should be around a 1.3GHz A55.

Testing on a 1.8GHz Kryo Silver (A55 derivative):

Encode (NEON): 837.907 MB/s
Decode (NEON): 850.552 MB/s
CRC32 (generic): 5196.66 MB/s
CRC32 256^n: 9.89022 Mop/s

So roughly in the same ballpark. Not quite their "2x NEON" claim, though the RVV code probably has some optimisation opportunities.
(also, do you know the clockspeed? they don't seem to advertise that)

I think the CRC32 kernel displayed above is incorrect as it should be using ARM-CRC acceleration.
I don't think your CPU supports scalar crypto (Zbc/Zbkc), so your CRC32 result is likely for the generic implementation.

sanderjo commented 3 months ago

(also, do you know the clockspeed? they don't seem to advertise that)

CPU: Spacemit X60 (8) @ 1.600GHz

image

sander@bananapif3:~$ neofetch 
                                 sander@bananapif3 
                                 ----------------- 
      █ █ █ █ █ █ █ █ █ █ █      OS: Armbian (24.5.0-trunk) riscv64 
     ███████████████████████     Host: spacemit k1-x deb1 board 
   ▄▄██                   ██▄▄   Kernel: 6.1.15-legacy-k1 
   ▄▄██    ███████████    ██▄▄   Uptime: 2 hours, 13 mins 
   ▄▄██   ██         ██   ██▄▄   Packages: 1321 (dpkg) 
   ▄▄██   ██         ██   ██▄▄   Shell: bash 5.2.21 
   ▄▄██   ██         ██   ██▄▄   Resolution: 1920x1080 
   ▄▄██   █████████████   ██▄▄   Terminal: /dev/pts/1 
   ▄▄██   ██         ██   ██▄▄   CPU: Spacemit X60 (8) @ 1.600GHz 
   ▄▄██   ██         ██   ██▄▄   Memory: 224MiB / 3809MiB 
   ▄▄██   ██         ██   ██▄▄
   ▄▄██                   ██▄▄                           
     ███████████████████████                             
      █ █ █ █ █ █ █ █ █ █ █

sander@bananapif3:~$ 
sanderjo commented 3 months ago

I think the CRC32 kernel displayed above is incorrect as it should be using ARM-CRC acceleration. I don't think your CPU supports scalar crypto (Zbc/Zbkc), so your CRC32 result is likely for the generic implementation.

Some other z-options, but not zbc nor zbkc

sander@bananapif3:~$ cat /proc/cpuinfo 
processor   : 0
hart        : 0
model name  : Spacemit(R) X60
isa     : rv64imafdcv_sscofpmf_sstc_svpbmt_zicbom_zicboz_zicbop_zihintpause
mmu     : sv39
mvendorid   : 0x710
marchid     : 0x8000000058000001
mimpid      : 0x1000000049772200
animetosho commented 3 months ago

Thanks for the info!

By the way, if there's something else you want to test with your new board, there is RVV code in ParPar as well (which is also used as a base for par2cmdline-turbo).
If you want to test, you'll need the dev branch of the code, do a cmake in the test/bench folder, then run ./bench-gf16 -fmuladdmp