LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.43k stars 528 forks source link

build.sh depends on bash / LS15.0 vs OpenCL on M1 #1523

Open gcp opened 3 years ago

gcp commented 3 years ago

Applies to v0.26.3 and current git master.

Building with ./build.sh generates an lc0 that will fail during OpenCL tuning. If a tuning already exists, the program generates bogus output:

Selected device: Apple M1
with OpenCL 1.2 capability.
Loaded existing SGEMM tuning for batch size 16.
Wavefront/Warp size: 32

Max workgroup size: 256
Max workgroup dimensions: 256 256 256
info depth 1 seldepth 2 time 8052 nodes 3 score cp 24 nps 10 tbhits 0 pv d2d4 g8f6
info depth 2 seldepth 3 time 8351 nodes 5 score cp 23 nps 8 tbhits 0 pv d2d4 g8f6 c2c4
info depth 2 seldepth 4 time 8511 nodes 11 score cp 21 nps 14 tbhits 0 pv e2e4 e7e5 g1f3 b8c6
info depth 2 seldepth 4 time 8512 nodes 12 score cp 22 nps 16 tbhits 0 pv d2d4 g8f6 c2c4 e7e6
info depth 3 seldepth 5 time 8757 nodes 19 score cp 73 nps 19 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 b1d2
info depth 3 seldepth 5 time 8827 nodes 22 score cp 81 nps 20 tbhits 0 pv e2e4 e7e5 g1f3 b8c6 f1a6
info depth 3 seldepth 6 time 9151 nodes 38 score cp -2 nps 27 tbhits 0 pv e2e4 e7e5 g1f3 b8c6 f1a6 d8h4
info depth 3 seldepth 6 time 9350 nodes 51 score cp -2 nps 32 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 b1d2 f8a3
Assertion failed: (0.0f <= p && p <= 1.0f), function SetP, file ../../src/mcts/node.cc, line 165.

Surprisingly, brew install -s lc0 does work. After a lot of head-scratching, I found that

meson -Dgtest=false build/release
cd build/release
ninja -v

Generates a working executable. And indeed, if we check the homebrew formula: https://github.com/Homebrew/homebrew-core/commit/68a8bfe6f9faa8fedcaff019ae3028f694b06907#diff-aed5de394cd8aac8dfcc5a575306beea0f6b1e705e6daf353f6b6be08eba8ea9R27

Unfortunately there's no comment about why that was added. But the above results are 100% reproducible for me.

gsobala commented 3 years ago

Completely unable to reproduce this using the latest git master on an MacBook Air M1/Big Sur 11.2.1 : so what network were you tuning / running?

(Homebrew excluding gtest may be more because of a bug in gtest causing it to fail on Macs rather than anything else #1439)

gcp commented 3 years ago

Happens with all networks I tested:

256x20-t40-1541.pb.gz  
J104.1-30  
LS15.0.net 
42850.net              
J94-100
gcp commented 3 years ago

I realize this thing sounds crazy but it is very reproducible. At first I thought the M1 OpenCL drivers just didn't work (wouldn't be too surprising) until I read elsewhere that they do!

I tried emptying ccache, didn't help. I tried a completely fresh checkout of lc0 (in a new directory), didn't help.

It's only OpenCL that is affected, vecLib/Accelerate works.

gsobala commented 3 years ago
(venv) george@Georges-Air release % file lc0
lc0: Mach-O 64-bit executable arm64
(venv) george@Georges-Air release % md5 lc0
MD5 (lc0) = f554ba74075e80891f7546a3377dcb6a
(venv) george@Georges-Air release % ./lc0 -w ~/pgn/J94-100 
       _
|   _ | |
|_ |_ |_| v0.28.0-dev+git.b1bf3f3 built Feb 12 2021
go nodes 5000
Loading weights file from: /Users/george/pgn/J94-100
Creating backend [opencl]...
OpenCL, maximum batch size set to 16.
Initializing OpenCL.
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 (Dec 21 2020 17:26:51)
Platform profile: FULL_PROFILE
Platform name:    Apple
Platform vendor:  Apple
Device ID:      0
Device name:    Apple M1
Device type:    GPU
Device vendor:  Apple
Device driver:  1.2 1.0
Device speed:   1000 MHZ
Device cores:   8 CU
Device score:   112
Selected platform: Apple
Selected device: Apple M1
with OpenCL 1.2 capability.
Loaded existing SGEMM tuning for batch size 16.
Wavefront/Warp size: 32

Max workgroup size: 256
Max workgroup dimensions: 256 256 256
info depth 1 seldepth 2 time 2819 nodes 6 score cp 10 nps 32 tbhits 0 pv d2d4 g8f6
info depth 2 seldepth 3 time 3136 nodes 18 score cp 7 nps 35 tbhits 0 pv g2g3 d7d5 g1f3
info depth 2 seldepth 4 time 3539 nodes 28 score cp 8 nps 30 tbhits 0 pv g2g3 g7g6 c2c4 c7c6
info depth 3 seldepth 4 time 3566 nodes 53 score cp 8 nps 56 tbhits 0 pv g1f3 d7d5 d2d4 g8f6
info depth 3 seldepth 5 time 4260 nodes 73 score cp 8 nps 44 tbhits 0 pv g1f3 d7d5 g2g3 c8g4 c2c4
info depth 3 seldepth 5 time 4317 nodes 150 score cp 9 nps 89 tbhits 0 pv c2c4 e7e5 g2g3 d7d5 c4d5
info depth 3 seldepth 6 time 4676 nodes 166 score cp 8 nps 81 tbhits 0 pv g1f3 d7d5 g2g3 c8g4 c2c4 g4f3
info depth 4 seldepth 6 time 5102 nodes 223 score cp 11 nps 90 tbhits 0 pv d2d4 d7d5 c2c4 c7c6 b1c3 g8f6
info depth 4 seldepth 7 time 5709 nodes 275 score cp 10 nps 89 tbhits 0 pv d2d4 g8f6 c2c4 c7c6 e2e3 d7d5 b1c3
info depth 4 seldepth 8 time 6400 nodes 340 score cp 10 nps 90 tbhits 0 pv d2d4 g8f6 c2c4 c7c6 e2e3 d7d5 b1c3
info depth 4 seldepth 9 time 8412 nodes 679 score cp 10 nps 117 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2 b4e7
info depth 5 seldepth 9 time 9061 nodes 779 score cp 10 nps 121 tbhits 0 pv d2d4 d7d5 c2c4 e7e6 b1c3 g8f6 c4d5 e6d5
info depth 5 seldepth 10 time 10695 nodes 915 score cp 10 nps 113 tbhits 0 pv d2d4 d7d5 c2c4 e7e6 b1c3 g8f6 c4d5 e6d5
info depth 5 seldepth 11 time 12518 nodes 1285 score cp 10 nps 130 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2 b4e7 d1c2
info depth 5 seldepth 11 time 17588 nodes 1842 score cp 10 nps 123 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2 b4e7 d1c2 c7c6
info depth 5 seldepth 12 time 18067 nodes 2057 score cp 10 nps 133 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2 b4e7 d1c2 c7c6
info depth 5 seldepth 12 time 23119 nodes 3149 score cp 10 nps 153 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2 b4e7 d1c2 c7c6 d2f4
info depth 6 seldepth 12 time 24265 nodes 3368 score cp 10 nps 155 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2 b4e7 d1c2 c7c6 d2f4
info depth 6 seldepth 13 time 26434 nodes 3624 score cp 10 nps 152 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2 b4e7 d1c2 c7c6 d2f4
info depth 6 seldepth 14 time 30262 nodes 4337 score cp 10 nps 156 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2 b4e7 d1c2 c7c6 d2f4
bestmove d2d4 ponder g8f6
quit
(venv) george@Georges-Air release % 

Your speed in the example you gave is very slow. What net was that? Are you sure you built an ARM64 binary?

gcp commented 3 years ago

Yes, it's an ARM64 binary. The speed is low because the 8s delay before it starts. The working executable generates similar speeds as yours.

gsobala commented 3 years ago

I don't get an 8 second delay. Which net? Can we have a look at your meson.txt in ../build/release/meson-logs in the failing compile ?

gcp commented 3 years ago

I don't get an 8 second delay.

Well, it's 3 seconds in your log, but that did give me an idea...

Yes, it's an ARM64 binary.

Hohoho, the broken one isn't!

So enabling tests causes the build output to be an x86_64 executable, for some reason.

gcp commented 3 years ago

meson-log.txt (broken)

meson-log.txt (correct)

gsobala commented 3 years ago

Can I check whether

meson build/release
cd build/release
ninja -v

also fails and produces an x86 binary?

Incidentally I had problems in November on M1 unintentionally ending up with x86 binaries, which turned out to be due to an x86 version of ninja on the system.

My compiles above were done without homebrew (I had compiled cmake and ninja from source) but I have just tried with homebrew 3.0 and everything seems fine, arm64 binaries produced. Can I therefore also check that you are on homebrew 3.0 and not an earlier version that is not as M1 aware?

borg323 commented 3 years ago

Comparing the two meson logs, I see this check is not executed in the working one, so that -march=native is not added to the compiler options. Then I noticed that this is not a release build, which may explain what happens. @gcp can you confirm whether this is the actual issue and not gtest?

gcp commented 3 years ago

also fails and produces an x86 binary?

That generates an arm64 binary. But:

morbo@MacBook-Air:~/git/lc0/build/release % file lc0
lc0: Mach-O 64-bit executable arm64
morbo@MacBook-Air:~/git/lc0/build/release % ./lc0 -w ~/LS15.0.net
       _
|   _ | |
|_ |_ |_| v0.28.0-dev+git.b1bf3f3 built Feb 12 2021
go infinite
Loading weights file from: /Users/morbo/LS15.0.net
Creating backend [opencl]...
OpenCL, maximum batch size set to 16.
Initializing OpenCL.
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 (Dec 21 2020 17:26:51)
Platform profile: FULL_PROFILE
Platform name:    Apple
Platform vendor:  Apple
Device ID:      0
Device name:    Apple M1
Device type:    GPU
Device vendor:  Apple
Device driver:  1.2 1.0
Device speed:   1000 MHZ
Device cores:   8 CU
Device score:   112
Selected platform: Apple
Selected device: Apple M1
with OpenCL 1.2 capability.
Loaded existing SGEMM tuning for batch size 16.
Wavefront/Warp size: 32

Max workgroup size: 256
Max workgroup dimensions: 256 256 256
info depth 1 seldepth 2 time 9247 nodes 3 score cp 24 nps 10 tbhits 0 pv d2d4 g8f6
info depth 2 seldepth 3 time 9752 nodes 5 score cp 23 nps 6 tbhits 0 pv d2d4 g8f6 c2c4
info depth 2 seldepth 4 time 9917 nodes 11 score cp 21 nps 11 tbhits 0 pv e2e4 e7e5 g1f3 b8c6
info depth 2 seldepth 4 time 9923 nodes 12 score cp 22 nps 12 tbhits 0 pv d2d4 g8f6 c2c4 e7e6
info depth 3 seldepth 5 time 10150 nodes 19 score cp 73 nps 15 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 c1h6
info depth 3 seldepth 5 time 10234 nodes 22 score cp 81 nps 17 tbhits 0 pv e2e4 e7e5 g1f3 b8c6 f1a6
info depth 3 seldepth 6 time 10569 nodes 38 score cp -2 nps 23 tbhits 0 pv e2e4 e7e5 g1f3 b8c6 f1a6 d8h4
info depth 3 seldepth 6 time 10759 nodes 51 score cp -2 nps 28 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 c1h6 f8a3
Assertion failed: (0.0f <= p && p <= 1.0f), function SetP, file ../../src/mcts/node.cc, line 165.
zsh: abort      ./lc0 -w ~/LS15.0.net

Crazy!

gcp commented 3 years ago

Can I therefore also check that you are on homebrew 3.0 and not an earlier version that is not as M1 aware?

It's

Homebrew 3.0.1
Homebrew/homebrew-core (git revision c941c; last commit 2021-02-12)
Homebrew/homebrew-cask (git revision 53f32; last commit 2021-02-12)

I have the recommended setup with x86 homebrew in /usr/local (and behind an "ibrew" alias) and arm64 homebrew in /opt/homebrew

When diffing the meson files, I did notice one of them got an extra /usr/local prefix added somewhere.

gcp commented 3 years ago

Now I did a build with

meson -Dgtest=false --buildtype release build/release
cd build/release
ninja -v

and I get the output:

morbo@MacBook-Air:~/git/lc0/build/release % ./lc0 -w ~/LS15.0.net
       _
|   _ | |
|_ |_ |_| v0.28.0-dev+git.b1bf3f3 built Feb 12 2021
go infinite
Loading weights file from: /Users/morbo/LS15.0.net
Creating backend [opencl]...
OpenCL, maximum batch size set to 16.
Initializing OpenCL.
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 (Dec 21 2020 17:26:51)
Platform profile: FULL_PROFILE
Platform name:    Apple
Platform vendor:  Apple
Device ID:      0
Device name:    Apple M1
Device type:    GPU
Device vendor:  Apple
Device driver:  1.2 1.0
Device speed:   1000 MHZ
Device cores:   8 CU
Device score:   112
Selected platform: Apple
Selected device: Apple M1
with OpenCL 1.2 capability.
Loaded existing SGEMM tuning for batch size 16.
Wavefront/Warp size: 32

Max workgroup size: 256
Max workgroup dimensions: 256 256 256
info depth 1 seldepth 2 time 1850 nodes 3 score cp 24 nps 18 tbhits 0 pv d2d4 g8f6
info depth 2 seldepth 3 time 2082 nodes 5 score cp 23 nps 12 tbhits 0 pv d2d4 g8f6 c2c4
info depth 2 seldepth 4 time 2225 nodes 11 score cp 21 nps 20 tbhits 0 pv e2e4 e7e5 g1f3 b8c6
info depth 2 seldepth 4 time 2245 nodes 12 score cp 22 nps 21 tbhits 0 pv d2d4 g8f6 c2c4 e7e6
info depth 3 seldepth 5 time 2466 nodes 17 score cp 73 nps 21 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 b1d2
info depth 3 seldepth 5 time 2466 nodes 21 score cp 81 nps 27 tbhits 0 pv e2e4 e7e5 g1f3 b8c6 f1a6
info depth 4 seldepth 6 time 2745 nodes 32 score cp -2 nps 30 tbhits 0 pv e2e4 e7e5 g1f3 b8c6 f1a6 g8e7
info depth 3 seldepth 6 time 2746 nodes 35 score cp -2 nps 33 tbhits 0 pv e2e4 e7e5 g1f3 b8c6 f1a6 g8e7
info depth 3 seldepth 6 time 2767 nodes 38 score cp 7 nps 35 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 b1d2 f8a3
info depth 3 seldepth 7 time 3168 nodes 51 score cp 20 nps 34 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 b1d2 f8a3 b2a3
info depth 4 seldepth 7 time 4459 nodes 178 score cp 27 nps 64 tbhits 0 pv g2g3 e7e5 f1h3 c7c5 g3g4
info depth 4 seldepth 8 time 4847 nodes 235 score cp 27 nps 74 tbhits 0 pv g2g3 e7e5 f1h3 c7c5 g3g4
info depth 5 seldepth 9 time 5720 nodes 323 score cp 2 nps 80 tbhits 0 pv g2g3 g8f6 f1h3 f6d5 h3g4 h7h6
info depth 4 seldepth 9 time 5999 nodes 332 score cp 2 nps 77 tbhits 0 pv g2g3 g8f6 f1h3 f6d5 h3g4 h7h6
info depth 5 seldepth 9 time 6459 nodes 404 score cp 2 nps 84 tbhits 0 pv g2g3 g8f6 f1h3 f6d5 h3g4 h7h6
info depth 4 seldepth 9 time 6871 nodes 435 score cp 2 nps 83 tbhits 0 pv g2g3 g8f6 f1h3 f6d5 h3g4 h7h6
info depth 5 seldepth 9 time 7497 nodes 499 score cp 9 nps 85 tbhits 0 pv e2e3 c7c5 g2g3 g7g6 g3g4
info depth 5 seldepth 10 time 10458 nodes 830 score cp 2 nps 94 tbhits 0 pv e2e3 c7c5 g2g3 g7g6 g3g4
info depth 5 seldepth 10 time 13429 nodes 1107 score cp 83 nps 94 tbhits 0 pv c2c4 e7e5 g2g3 d8h4 g3g4 d7d5 g4g5 a7a6 b1a3
info depth 6 seldepth 10 time 15185 nodes 1364 score cp 63 nps 101 tbhits 0 pv c2c4 e7e5 g2g3 d8h4 g3g4 d7d5 g4g5 a7a6 b1a3
info depth 6 seldepth 10 time 20238 nodes 1948 score cp 0 nps 105 tbhits 0 pv c2c4 e7e5 g2g3 d8h4 g3g4 d7d5 g4g5 a7a6 b1a3
info depth 6 seldepth 11 time 22798 nodes 2347 score cp 0 nps 111 tbhits 0 pv c2c4 e7e5 g2g3 d8h4 g3g4 d7d5 g4g5 a7a6 b1a3
info depth 6 seldepth 11 time 27803 nodes 2866 score cp 0 nps 109 tbhits 0 pv c2c4 e7e5 g2g3 d8h4 g3g4 d7d5 g4g5 a7a6 b1a3
info depth 6 seldepth 11 time 32814 nodes 3496 score cp 0 nps 112 tbhits 0 pv c2c4 e7e5 g2g3 d8h4 g3g4 d7d5 g4g5 a7a6 b1a3
info depth 6 seldepth 12 time 33554 nodes 3555 score cp 0 nps 111 tbhits 0 pv c2c4 e7e5 g2g3 d8h4 g3g4 d7d5 g4g5 a7a6 b1a3
info depth 6 seldepth 12 time 38563 nodes 4309 score cp 0 nps 116 tbhits 0 pv c2c4 e7e5 g2g3 d8h4 g3g4 d7d5 g4g5 a7a6 b1a3
info depth 6 seldepth 12 time 40512 nodes 4555 score cp 15 nps 117 tbhits 0 pv d2d4 g8f6 e2e3 e7e6 g1e2 c7c5 h2h3 g7g6
info depth 6 seldepth 12 time 45594 nodes 4863 score cp 0 nps 110 tbhits 0 pv d2d4 g8f6 e2e3 e7e6 g1e2 c7c5 h2h3 g7g6
info depth 6 seldepth 12 time 49980 nodes 5900 score cp 20 nps 122 tbhits 0 pv f2f3 d7d5 e1f2 b8a6 c2c4 c8d7 g2g3 a6c5

Which, while not crashing, doesn't look correct to me either!

This is really, really strange. Never have seen anything like it. The closest guess I have is that changing random build options causes some datastructure to be laid out differently, and we're dealing with a memory overwrite or misalignment problem.

borg323 commented 3 years ago

Is this last binary an arm or an x86 one?

gcp commented 3 years ago

This was an ARM one.

If I use build.sh, I get x86. Using meson directly, I get ARM.

I figured out why, I think:

build.sh contains:

#!/usr/bin/env bash
which bash
/usr/local/bin/bash

build.sh should probably just use sh, not bash.

gcp commented 3 years ago

Okay, and it's only the LS15 network that produces strange results with the ARM binary, the others are fine. So I think we can ignore that, perhaps? Is it supposed to work with git master lc0?

The conclusion is that the problem was that I had an old x86 bash in /usr/local/bin, which was being picked up over the system bash when using the build.sh script. That's partly my fault, but I would recommend using the system default shell, though.

gcp commented 3 years ago

What's really deceptive here is that build.sh prints

Host machine cpu family: aarch64
Host machine cpu: arm64

but you'll get an x86 binary.

Comparing the two meson logs, I see this check is not executed in the working one, so that -march=native is not added to the compiler options.

This turns out to be because clang rejects that option when compiling for ARM64.

gcp commented 3 years ago

And this is a log showing that there's some funny things going on with LS15.0 that made debugging his harder than needed: LS15_OpenCL_M1.txt

So LS15 works after a tuning run, but not when resuming from a stored tuning.

borg323 commented 3 years ago

Can you do one more test running lc0 with --backend=check? This may help pinpoint if errors are in value, policy or both.

gcp commented 3 years ago
Creating backend [eigen]...
Using Eigen version 3.3.9
Eigen max batch size is 256.
Check mode: check only with relative tolerance 1.0e-04, absolute tolerance 1.0e-05.
Check rate: 20%.
info depth 1 seldepth 2 time 2179 nodes 3 score cp 24 nps 19 tbhits 0 pv d2d4 g8f6
*** ERROR check failed for a batch of 32 both value and policy incorrect.
gsobala commented 3 years ago

I think we need to separate this out into two issues.

Firstly, there is the accidental production of an x86 binary on M1 under certain starting conditions. @gcp :why is your default bash in /usr/local/bin ? The OS default is in /bin . Where did the /usr/local/bin/bash come from? Is it perhaps an x86 binary? If so, I suspect that is the explanation.

Secondly I confirm GCP's observation that the arm64 binary performs strangely under OpenCL on M1 with net LS15.0 but appears to be fine with the 60000 and 70000 series and J92 / J94.

The first run of LS15 without an existing OpenCL tuning generates a tuning and looks fine, but does generate errors with --backend=check e.g.

george@Georges-Air release % ./lc0 -w ~/pgn/LS15-20x256SE-jj-9-75000000.pb.gz --backend=check
       _
|   _ | |
|_ |_ |_| v0.28.0-dev+git.b1bf3f3 built Feb 12 2021
go nodes 5000
Loading weights file from: /Users/george/pgn/LS15-20x256SE-jj-9-75000000.pb.gz
Creating backend [check]...
Working backend set to opencl.
Reference backend set to eigen.
Creating backend [opencl]...
OpenCL, maximum batch size set to 16.
Initializing OpenCL.
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 (Dec 21 2020 17:26:51)
Platform profile: FULL_PROFILE
Platform name:    Apple
Platform vendor:  Apple
Device ID:      0
Device name:    Apple M1
Device type:    GPU
Device vendor:  Apple
Device driver:  1.2 1.0
Device speed:   1000 MHZ
Device cores:   8 CU
Device score:   112
Selected platform: Apple
Selected device: Apple M1
with OpenCL 1.2 capability.
Started OpenCL SGEMM tuner with batch size 16.
Will try 578 valid configurations.
(1/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=16 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 85.0 us (6318.6 GFLOPS)
(2/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=16 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 57.7 us (9301.1 GFLOPS)
(6/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 57.5 us (9331.1 GFLOPS)
(8/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=64 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 56.8 us (9450.7 GFLOPS)
(9/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=64 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 51.9 us (10346.7 GFLOPS)
(73/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=64 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 51.0 us (10532.6 GFLOPS)
(116/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=1 47.2 us (11368.1 GFLOPS)
(193/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=64 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=2 44.6 us (12048.7 GFLOPS)
(226/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=64 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=2 43.8 us (12250.0 GFLOPS)
(284/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=64 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4 43.6 us (12308.9 GFLOPS)
Wavefront/Warp size: 32

Max workgroup size: 256
Max workgroup dimensions: 256 256 256
Creating backend [eigen]...
Using Eigen version 3.3.7
Eigen max batch size is 256.
Check mode: check only with relative tolerance 1.0e-04, absolute tolerance 1.0e-05.
Check rate: 20%.
info depth 1 seldepth 2 time 15684 nodes 3 score cp 24 nps 20 tbhits 0 pv d2d4 g8f6
info depth 2 seldepth 3 time 15908 nodes 5 score cp 23 nps 13 tbhits 0 pv d2d4 g8f6 c2c4
info depth 2 seldepth 4 time 16008 nodes 11 score cp 21 nps 23 tbhits 0 pv e2e4 e7e5 g1f3 b8c6
info depth 2 seldepth 4 time 16094 nodes 12 score cp 22 nps 21 tbhits 0 pv d2d4 g8f6 c2c4 e7e6
info depth 3 seldepth 5 time 16549 nodes 17 score cp 22 nps 16 tbhits 0 pv d2d4 g8f6 c2c4 e7e6
info depth 2 seldepth 5 time 16620 nodes 19 score cp 22 nps 17 tbhits 0 pv d2d4 g8f6 c2c4 e7e6
*** ERROR check failed for a batch of 14 policy incorrect (but value ok).
info depth 3 seldepth 5 time 16641 nodes 24 score cp 20 nps 21 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3
info depth 3 seldepth 6 time 16670 nodes 27 score cp 20 nps 23 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3
info depth 3 seldepth 7 time 16981 nodes 28 score cp 20 nps 19 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3
info depth 3 seldepth 8 time 17090 nodes 32 score cp 21 nps 20 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3
Check passed for a batch of 5.
info depth 3 seldepth 9 time 17352 nodes 36 score cp 20 nps 19 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3
Check passed for a batch of 7.
info depth 4 seldepth 10 time 17551 nodes 38 score cp 21 nps 18 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3
info depth 4 seldepth 12 time 17736 nodes 48 score cp 19 nps 21 tbhits 0 pv c2c4 e7e5 g2g3 g8f6 f1g2 f8c5 d2d3 b8c6 b1c3 e8g8 a2a3 a7a5
info depth 4 seldepth 13 time 17808 nodes 49 score cp 19 nps 21 tbhits 0 pv c2c4 e7e5 g2g3 g8f6 f1g2 f8c5 d2d3 b8c6 b1c3 e8g8 a2a3 a7a5 e2e3
*** ERROR check failed for a batch of 31 policy incorrect (but value ok).
info depth 5 seldepth 14 time 17910 nodes 65 score cp 19 nps 27 tbhits 0 pv c2c4 e7e5 g2g3 g8f6 f1g2 f8c5 d2d3 b8c6 b1c3 e8g8 a2a3 a7a5 e2e3 c5a7
info depth 5 seldepth 14 time 18147 nodes 75 score cp 20 nps 28 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2
info depth 5 seldepth 14 time 18187 nodes 84 score cp 21 nps 31 tbhits 0 pv e2e4 e7e5 g1f3 b8c6 f1b5 g8f6 e1g1
info depth 5 seldepth 14 time 18394 nodes 101 score cp 21 nps 35 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 g2g3 f8b4
info depth 6 seldepth 14 time 18834 nodes 131 score cp 21 nps 39 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2 b4e7 f1g2 d7d5
*** ERROR check failed for a batch of 16 policy incorrect (but value ok).

A further run of lc0 and LS15 with an existing OpenCL tuning is much more haywire with more extensive check errors just as GCP noted above.

gsobala commented 3 years ago

Actually I note I do get occasional *** ERROR check failed for a batch of xx policy incorrect (but value ok). with other nets using established tunings, but the output seems sane.

borg323 commented 3 years ago

We suspect the occasional policy errors are due to a check backend deficiency, that checks the full policy output including invalid moves. When the check backend was introduced, the policy output from the backends was softmaxed and the accuracy limits more meaningful.

gcp commented 3 years ago

Firstly, there is the accidental production of an x86 binary on M1 under certain starting conditions. @gcp :why is your default bash in /usr/local/bin ? The OS default is in /bin . Where did the /usr/local/bin/bash come from? Is it perhaps an x86 binary? If so, I suspect that is the explanation.

This is the default location for homebrew (x64). The ARM64 homebrew installs under /opt/homebrew. It is indeed an Intel binary. It gets run under Rosetta, which causes all spawned sub-processes to then also run under Rosetta.

The problem exists because the homebrew binaries get priority over the built-in ones, and while at some point bash got installed under x64 homebrew, the same didn't happen for ARM64, so the x64 ended up being the active one.

I never noticed this because I don't use bash. As explained, as the build.sh script probably doesn't (need to) depend on bash-ishms, it might be better for it to refer to /bin/sh to avoid such problems. It will also keep working when bash is deprecated (it already is on macOS, but a very old binary is still shipped).

gsobala commented 3 years ago

Of course OpenCL is deprecated under latest MacOS as well.

gsobala commented 3 years ago

Ok - some further info on the failure with LS15. Bear in mind that I know nothing about OpenCL or GPU programming in general. However it is clear that just reading the tuning from file fails, whereas generating a new tuning works. Therefore something executed in tune_sgemm() is critical to success. So I set up a test load_sgemm_tuners() whereby tune_sgemm() was called first, the result discarded and then the saved tuning from file was returned. I then played around with tune_sgemm() to see what was critical (bailing out and returning a dummy value at various places in the routine).

It seems that a single call to

297       queue.enqueueWriteBuffer(aBuffer, CL_FALSE, 0, at_size * sizeof(float),
298                                at.data());

is enough to 'fix' the problem. Now maybe an OpenCL guru can fix this.

gsobala commented 3 years ago

Tested a few other nets, the only one to also fail both policy and value using the check backend and to produce nonsensical output is 20x256SE-jj-9-53420000.pb.gz

11258-112x9-se.pb.gz | Pass |  
11258-128x10-se.pb.gz | Pass |  
20x256SE-jj-9-53420000.pb.gz | Double fail | Very frequent
703810.pb | Policy fail | Occasional
J64-210 | Policy fail | Rare
J92-330 | Pass |  
J94-100 | Pass |  
LS15-20x256SE-jj-9-75000000.pb.gz | Double fail | Very frequent
badgyal-7.pb.gz | Pass |  
maia-1900.pb.gz | Policy fail | Frequent
weights_run1_66511.pb.gz | Policy fail | Occasional
weights_run1_67512.pb.gz | Policy fail | Occasional