Matthies / RubiChess

Another chess engine
GNU General Public License v3.0
150 stars 16 forks source link

INTRINSICS optimization #450

Closed Negatil07 closed 5 months ago

Negatil07 commented 6 months ago

Please add intrinsics optimization?

Matthies commented 6 months ago

What do you mean? RubiChess already uses a lot of intrinsics in the NNUE code. Any specific suggestions what to improve?

Negatil07 commented 6 months ago

What do you mean? RubiChess already uses a lot of intrinsics in the NNUE code. Any specific suggestions what to improve?

I mean how can I add intrinsics to the Makefile ((SMPFLAGS = -DSMP -DSMP_STATS -DUSE_INTRINSICS))?

Terminal (Termux) Android (Nougat) Aarch64

Correct me if I'm wrong...this is what i did.

` #

Makefile to compile RubiChess in a standard GNU/Makefile environment


MAKEFLAGS += --no-print-directory

Some variables for my private build system

ANDROIDEMU = ~/Android/Sdk/emulator/emulator


SDE = ~/sde/sde

# ifeq ($(OS), Windows_NT) EXEEXT=.exe else EXEEXT= endif

CPUTEST = cputest ARCH = native PROFDIR = OPT STRIP = strip

sse2 = no ssse3 = no popcnt = no bmi1 = no lzcnt = no avx2 = no bmi2 = no avx512 = no neon = no arm64 = no dotprod = no zlib = no debug = no bits = 64


UNAME_S=$(shell uname -s) UNAME_M=$(shell uname -m)

ifeq ($(UNAME_S),Darwin)

Use clang compiler for MacOS



ifeq ($(UNAME_S),Linux) COMP=clang endif

ifeq ($(COMP),) COMP=gcc endif

ifeq ($(EXE),) EXE=RubiChess endif

CXXFLAGS=-std=c++11 -Wall -pedantic -Wextra -Wshadow CFLAGS=-Wall -pedantic -Wextra -Wshadow -Wno-implicit-function-declaration


ifeq ($(debug),yes) CXXFLAGS += -g -O0 CFLAGS += -g -O0 else CXXFLAGS += -O3 -flto=thin CFLAGS += -O3 endif

ifneq (, $(findstring MINGW64,$(UNAME_S)))

Always do a static built with Mingw

LDFLAGS += -static

endif ifneq (, $(findstring MINGW32,$(UNAME_S)))

Always do a static built with Mingw

LDFLAGS += -static
bits = 32

endif ifeq ($(COMP),$(filter $(COMP),gcc)) CXX=g++

Workaround OB giving wrong CC=g++

PGOEXTRACXXFLAGS='-fprofile-use=$(PROFDIR) -fno-peel-loops -fno-tracer -Wno-coverage-mismatch -fprofile-correction'


ifeq ($(COMP),$(filter $(COMP), clang ndk icx)) CXX=clang++ MYCC=clang ifeq ($(COMP),$(filter $(COMP), icx)) CXX=icpx MYCC=icx endif LDFLAGS += $(shell type lld 1>/dev/null 2>/dev/null && echo "-fuse-ld=lld") INSTRUMENTEDEXTRACXXFLAGS='-fprofile-instr-generate=$(EXE).clangprof-raw' INSTRUMENTEDEXTRALDFLAGS= PGOEXTRACXXFLAGS='-fprofile-instr-use=$(EXE).profdata' PGOEXTRALDFLAGS= PROFMERGE=llvm-profdata merge -output=$(EXE).profdata $(EXE).clangprof-raw endif

ifeq ($(UNAME_M),armv6l)

FIXME: Find better flags

CXXFLAGS += -mcpu=native
bits = 32


ifeq ($(UNAME_M),armv7l)

FIXME: Hardcoded -mfpu=neon for Neon support

CXXFLAGS += -mthumb -march=armv7-a -mfpu=neon
bits = 32

endif #

ifeq ($(UNAME_M),aarch64)

# ARM64... any better flags that work with gcc and clang??

CXXFLAGS += -mcpu=native


bits = 64


ifeq ($(UNAME_M),x86_64) ARCHFAMILY=x86 endif

ifeq ($(COMP),clang) ARCHFAMILY=android STRIP=llvm-strip EXEEXT= ifeq ($(ARCH),armv8) CXX=clang++ MYCC=clang CXXFLAGS += -march=armv8-a+fp+simd+crc+aes+sha2 -DUSE_POPCNT -DUSE_NEON=8 ARCHFAMILY=aarch64 CPUFLAGS = "armv8" endif LDFLAGS += -static-libstdc++ endif

ifeq ($(COMP),ndk) ARCHFAMILY=android STRIP=llvm-strip AR=llvm-ar EXEEXT= ifeq ($(ARCH),armv7) CXX=armv7a-linux-androideabi21-clang++ MYCC=armv7a-linux-androideabi21-clang CXXFLAGS += -mthumb -march=armv7-a -mfloat-abi=softfp -mfpu=neon endif ifneq (,$(findstring armv8,$(ARCH))) CXX=aarch64-linux-android21-clang++ MYCC=aarch64-linux-android21-clang ifneq (,$(findstring -dotprod,$(ARCH))) CXXFLAGS += -mcpu=cortex-a53+dotprod endif endif ifeq ($(ARCH),x86-32) CXX=i686-linux-android21-clang++ MYCC=i686-linux-android21-clang CPUFLAGS = "ssse3 sse2" bits = 32 endif ifeq ($(ARCH),x86-64) CXX=x86_64-linux-android21-clang++ MYCC=x86_64-linux-android21-clang CPUFLAGS = "popcnt lzcnt ssse3 sse2" endif LDFLAGS += -static-libstdc++ endif

SMPFLAGS := $(SMPFLAGS) -pthread

ZLIBDIR=zlib CXXFLAGS += -Izlib LDFLAGS += -Lzlib zlib = yes

ifeq (,$(CPUFLAGS)) ifeq ($(ARCH),native) ifeq ($(wildcard ./$(CPUTEST)),) $(shell $(CXX) $(CXXFLAGS) $(LDFLAGS) -DCPUTEST $(CPUTEST).cpp -o $(CPUTEST)) endif CPUFLAGS = "$(shell ./$(CPUTEST))" endif

Common supported x86 and ARM architectures

ifneq (,$(findstring armv8,$(ARCH))) CPUFLAGS = "armv8" bits = 64 endif ifeq ($(findstring x86-32,$(ARCH)),x86-32) bits = 32 endif ifeq ($(findstring x86-64,$(ARCH)),x86-64) bits = 64 endif ifneq (,$(findstring -sse2,$(ARCH))) CPUFLAGS = "sse2" endif ifneq (,$(findstring -avx512,$(ARCH))) CPUFLAGS = "avx512 bmi2 avx2 bmi1 lzcnt popcnt ssse3 sse2" endif ifneq (,$(findstring -bmi2,$(ARCH))) CPUFLAGS = "bmi2 avx2 bmi1 lzcnt popcnt ssse3 sse2" endif ifneq (,$(findstring -avx2,$(ARCH))) CPUFLAGS = "avx2 bmi1 lzcnt popcnt ssse3 sse2" endif ifneq (,$(findstring -modern,$(ARCH))) CPUFLAGS = "popcnt ssse3 sse2" endif ifneq (,$(findstring -ssse3,$(ARCH))) CPUFLAGS = "ssse3 sse2" endif ifneq (,$(findstring -sse3-popcnt,$(ARCH))) CPUFLAGS = "popcnt sse2" endif ifneq (,$(findstring armv6,$(ARCH))) CPUFLAGS = "" bits = 32 endif ifneq (,$(findstring armv7,$(ARCH))) CPUFLAGS = "neon" bits = 32 endif

ifneq (,$(findstring armv8,$(ARCH)))

CPUFLAGS = "neon arm64"

bits = 64


ifneq (,$(findstring -dotprod,$(ARCH))) CPUFLAGS += " dotprod" endif endif

ifneq (,$(findstring armv8,$(CPUFLAGS))) armv8 = yes endif ifneq (,$(findstring avx512,$(CPUFLAGS))) avx512 = yes endif ifneq (,$(findstring bmi2,$(CPUFLAGS))) bmi2 = yes endif ifneq (,$(findstring avx2,$(CPUFLAGS))) avx2 = yes endif ifneq (,$(findstring bmi1,$(CPUFLAGS))) bmi1 = yes endif ifneq (,$(findstring lzcnt,$(CPUFLAGS))) lzcnt = yes endif ifneq (,$(findstring popcnt,$(CPUFLAGS))) popcnt = yes endif ifneq (,$(findstring ssse3,$(CPUFLAGS))) ssse3 = yes endif ifneq (,$(findstring sse2,$(CPUFLAGS))) sse2 = yes endif ifneq (,$(findstring neon,$(CPUFLAGS))) neon = yes endif ifneq (,$(findstring arm64,$(CPUFLAGS))) arm64 = yes endif ifneq (,$(findstring dotprod,$(CPUFLAGS))) dotprod = yes endif

ifeq ($(neon),no) ifeq (,$(findstring armv,$(UNAME_M))) ARCHFLAGS = -m$(bits) endif endif ifeq ($(bits),64) CXXFLAGS += -DIS_64BIT endif

ifeq ($(armv8),yes) ARCHFLAGS += -DUSE_ARM64 -DUSE_PTHREADS endif

ifeq ($(avx512),yes) ARCHFLAGS += -DUSE_AVX512 -mavx512f -mavx512bw endif ifeq ($(bmi2),yes) ARCHFLAGS += -DUSE_BMI2 -mbmi2 endif ifeq ($(avx2),yes) ARCHFLAGS += -DUSE_AVX2 -mavx2 endif ifeq ($(bmi1)$(lzcnt),yesyes) ARCHFLAGS += -DUSE_BMI1 -mbmi -mlzcnt endif ifeq ($(popcnt),yes) ARCHFLAGS += -DUSE_POPCNT -mpopcnt -msse3 endif ifeq ($(ssse3),yes) ARCHFLAGS += -DUSE_SSSE3 -mssse3 endif ifeq ($(sse2),yes) ARCHFLAGS += -DUSE_SSE2 -msse2 endif ifeq ($(neon),yes) ARCHFLAGS += -DUSE_NEON endif ifeq ($(arm64),yes) ARCHFLAGS += -DUSE_ARM64 endif ifeq ($(dotprod),yes) ARCHFLAGS += -DUSE_DOTPROD endif ifeq ($(zlib),yes) CXXFLAGS += -DUSE_ZLIB LDFLAGS += -lz endif

DEPS = RubiChess.h

GITVER = $(shell 2>/dev/null git show --name-only --abbrev-commit --date=format:%Y%m%d | grep -i "date:" | grep -o -E '[0-9]+') GITID = $(shell 2>/dev/null git show --name-only --abbrev-commit | grep -i -o -E "ommit[[:blank:]]+[0-9a-f]{6}" | grep -o -E '[0-9a-f]+') ifneq ($(GITVER),) GITDEFINE = -D GITVER=\"$(GITVER)\" endif ifneq ($(GITID),) GITDEFINE += -D GITID=\"$(GITID)\" endif

RUBINET = $(shell grep "NNUEDEFAULT " RubiChess.h | awk '{print $$3}') RUBINETHASH = $(shell echo $(RUBINET) | awk -F'-' '{print $$2}') NETURL = $(eval WGETCMD := $(shell if hash wget 2>/dev/null; then echo "wget -qO-"; elif hash curl 2>/dev/null; then echo "curl -skL"; fi)) ifneq ($(PROXY),) WGETCMD += -e https_proxy=$(PROXY) endif

NETBIN = net.nnue ifneq ($(EVALFILE),) ifeq ($(EVALFILE),default) EMBEDFILE = $(RUBINET) else EMBEDFILE = $(EVALFILE) endif NETDEF = -DNNUEINCLUDED=$(EMBEDFILE) NETOBJ = net.o endif

ifeq ($(GITVER),) MAJORVERSION = $(shell grep "#define VERNUMLEGACY " RubiChess.h | awk '{print $$3}') else MAJORVERSION = $(GITVER) endif MINORVERSION = "" VERSION=$(MAJORVERSION)$(MINORVERSION)

.PHONY: clean profile-build gcc-profile-make clang-profile-make net arch compile profilebench instrumentedcompile pgo profile-build pgo-rename release_x86 release_arm32 release_arm64 release

default: net @$(MAKE) -j1 pgo MESSAGE='Compiling pgo build ...' ifneq ($(debug),yes) @$(STRIP) $(EXE)$(EXEEXT) endif

build: net arch @$(MAKE) compile MESSAGE='Compiling standard build ...'

arch: libclean @echo @echo "Compiler: $(COMP)" @echo "Arch: $(ARCH)" @echo "Bits: $(bits)" @echo "CPU features:" @echo "=============" @echo "armv8 : $(armv8)" @echo "avx512 : $(avx512)" @echo "bmi2 : $(bmi2)" @echo "avx2 : $(avx2)" @echo "bmi1 : $(bmi1)" @echo "lzcnt : $(lzcnt)" @echo "popcnt : $(popcnt)" @echo "ssse3 : $(ssse3)" @echo "sse2 : $(sse2)" @echo "neon : $(neon)" @echo "arm64 : $(arm64)" @echo "dotprod: $(dotprod)" @echo "zlib : $(zlib)" @echo "debug : $(debug)"

net: ifeq ($(EVALFILE),$(filter $(EVALFILE),default)) ifeq ($(RUBINET),) echo "Network not found in header" else @if test -f $(RUBINET); then echo "$(RUBINET) already exists."; else echo "Downloading $(RUBINET)..."; $(WGETCMD) $(NETURL)$(RUBINET) > $(RUBINET); fi; $(eval shasum_command := $(shell if hash shasum 2>/dev/null; then echo "shasum -a 256 "; elif hash sha256sum 2>/dev/null; then echo "sha256sum "; fi)) @if [ "$(RUBINETHASH)" != $(shasum_command) $(RUBINET) | cut -c1-10 ]; then echo "Failed download or $(RUBINET) corrupted, please delete!"; exit 1; else echo "$(RUBINET) has correct hash."; fi endif endif ifneq ($(EVALFILE),) @echo Embedding networkfile $(EMBEDFILE) @cp $(EMBEDFILE) $(NETBIN) @ld -r -b binary $(NETBIN) -o $(NETOBJ) endif

$(ZLIBDIR)/libz.a: @echo Compiling zlib... @cd $(ZLIBDIR); $(MYCC) $(CFLAGS) $(ARCHFLAGS) -w -c .c; $(AR) rcs libz.a .o; cd ..


objclean: @$(RM) *.o $(AVX512EXE) $(BMI2EXE) $(AVX2EXE) $(DEFAULTEXE) $(SSSE3EXE) $(SSE2POPCNTEXE) $(LEGACYEXE) $(CPUTEST) $(NETBIN) || @echo $(RM) not available.

libclean: @$(RM) zlib/.o zlib/.a || @echo $(RM) not available.

profileclean: libclean @$(RM) -rf $(PROFDIR) || @echo $(RM) not available. @$(RM) .clangprof-raw .profdata || @echo $(RM) not available.

clean: objclean profileclean

profilebench: ifneq ($(COMP),ndk) @echo "Running bench to generate profiling data..." && ./$(EXE) -bench 1>/dev/null && ([ $$? -eq 0 ] && echo " Profiling successful!") \ || (echo " Profiling failed!" && [ "$(SDE)" != "" ] && echo " Trying to use SDE..." && $(SDE) -icx -- ./$(EXE) -bench 1>/dev/null && [ $$? -eq 0 ] && echo " Profiling successful!") \ || (echo " SDE not available or profiling with SDE failed! Profiling with native build..." && $(MAKE) profileclean && $(MAKE) compile ARCH=native EXTRACXXFLAGS=$(INSTRUMENTEDEXTRACXXFLAGS) EXTRALDFLAGS=$(INSTRUMENTEDEXTRALDFLAGS) MESSAGE='Compiling instrumented build ...' && ./$(EXE) -bench 1>/dev/null) else @([ "$(ANDROIDDEVICE)" != "" ] && adb disconnect 1>/dev/null && adb connect $(ANDROIDDEVICE) && adb root 1>/dev/null && adb shell "rm -rf /data/RubiChess;mkdir /data/RubiChess" \ && echo "Running bench on $(ANDROIDDEVICE) to generate profiling data..." \ && adb push $(EXE) $(RUBINET) /data/RubiChess 1>/dev/null \ && adb shell "cd /data/RubiChess && ./$(EXE) -bench 1>/dev/null" \ && adb pull /data/RubiChess/$(EXE).clangprof-raw . 1>/dev/null \ && echo " Profiling successful!") \ || ([ "$(ANDROIDEMU)" != "" ] && adb disconnect 1>/dev/null && ($(ANDROIDEMU) -avd $(ARCH) -no-snapshot-load -no-qt 1>/dev/null 2>&1 &) \ && echo "Wait 60 seconds for emulator startup..." && sleep 60 && adb root 1>/dev/null && adb shell "rm -rf /data/RubiChess;mkdir /data/RubiChess" \ && echo "Running bench on emulator $(ARCH) to generate profiling data..." \ && adb push $(EXE) $(RUBINET) /data/RubiChess 1>/dev/null \ && adb shell "cd /data/RubiChess && ./$(EXE) -bench 1>/dev/null" \ && adb pull /data/RubiChess/$(EXE).clangprof-raw . 1>/dev/null \ && adb emu kill 1>/dev/null 2>&1 \ && echo " Profiling successful!") \ || echo " Profiling failed!" @[ "$(ANDROIDEMU)" != "" ] && (adb emu kill 1>/dev/null 2>&1 || adb disconnect) endif


pgo: arch instrumentedcompile profilebench @$(PROFMERGE) @$(RM) ./$(EXE) @$(MAKE) compile EXTRACXXFLAGS=$(PGOEXTRACXXFLAGS) EXTRALDFLAGS=$(PGOEXTRALDFLAGS) MESSAGE='Compiling optimized build ...' @$(MAKE) profileclean ifneq ($(debug),yes) @$(STRIP) $(EXE)$(EXEEXT) endif @echo Binary $(EXE) created successfully.

profile-build: net @$(MAKE) -j1 pgo

pgo-rename: pgo @mv $(EXE) $(EXE)-$(VERSION)$(ARCH) @echo Successfully created $(EXE)-$(VERSION)$(ARCH)

release_x86: @$(MAKE) pgo-rename ARCH=x86-$(bits)-avx512 @$(MAKE) pgo-rename ARCH=x86-$(bits)-bmi2 @$(MAKE) pgo-rename ARCH=x86-$(bits)-avx2 @$(MAKE) pgo-rename ARCH=x86-$(bits)-modern @$(MAKE) pgo-rename ARCH=x86-$(bits)-ssse3 @$(MAKE) pgo-rename ARCH=x86-$(bits)-sse3-popcnt @$(MAKE) pgo-rename ARCH=x86-$(bits)-sse2 @$(MAKE) pgo-rename ARCH=x86-$(bits)


@$(MAKE) pgo-rename ARCH=armv8

# release_arm32: @$(MAKE) pgo-rename ARCH=armv7

release_android: net @$(MAKE) pgo-rename ARCH=armv8-dotprod COMP=$(COMP) @$(MAKE) pgo-rename ARCH=armv8 COMP=$(COMP)

@$(MAKE) pgo-rename ARCH=armv7 COMP=$(COMP)

#@$(MAKE) pgo-rename  ARCH=x86-64 COMP=$(COMP)
#@$(MAKE) pgo-rename  ARCH=x86-32 COMP=$(COMP)


release: net

@$(MAKE) release_$(ARCHFAMILY)

help: @echo "" @echo "Compile RubiChess with following command:" @echo "make target [ARCH=arch] [COMP=compiler] [EVALFILE=networkfile|default]" @echo "Supported targets:" @echo "build standard build (use if profile-build fails for some reason)" @echo "profile-build profiling optimized build, the default build target" @echo "release build all pgo optimized binaries CPU family" @echo "" @echo "ARCH should only be set when building a binary for a hardware different from the host" @echo "COMP can be gcc (default) or clang (usually faster but has some more dependencies)" @echo "Setting EVALFILE will build binaries with network included" @echo ""


Matthies commented 6 months ago

Format of your (modified) Makefile got scrambled, hard to find what you have changed. What I understand is that you try to compile on an Android box using Termux and the compilation should use armv8 intrinsics and even the dotprod extensions. I have rewritten some parts of the Makefile regarding Android but I never tested a native compilation. Always crosscompiling and running the profiling on some emulator or Android box connected with adb.

What happens when you just run make profile-build ARCH=armv8 If profiling doesn't work, you can try make build ARCH=armv8 This needs latest master, there was a bug in the make build part that was fixed yesterday.

For arm devices supporting the armv8.2-dotprod extension you should use make profile-build ARCH=armv8-dotprod

For embedding the network file you append EVALFILE=default but this is also untested on Android.

Negatil07 commented 6 months ago

This is what i used...

make -j2 profile-build ARCH=armv8 COMP=clang EVALFILE=default

....for embedded network.

Negatil07 commented 6 months ago

Screenshot_20240123-165347 Screenshot_20240123-165353 Screenshot_20240123-165400

Negatil07 commented 6 months ago

RubiChess Makefile


Matthies commented 6 months ago

Please have a look at and what I wrote in new "Option 1". Compiling a pgo build via Termux works perfectly without touching my original Makefile. Intrinsics are used! For RubiChess arm compilation the important defines are -DUSE_NEON -DUSE_ARM64 and -DUSE_DOTPROD if dotprot is supported. Your defines SMPFLAGS = -DSMP -DSMP_STATS -DUSE_INTRINSICS flags may be useful in other engine's Makefiles but they are useless in RubiChess.

Find attached a build log make release COMP=clang EVALFILE=default of my branch termux, that has a very small addition: Builds not only armv8 binary but also a armv8-dotprod if CPU supports it.
