apple / tensorflow_macos

TensorFlow for macOS 11.0+ accelerated using Apple's ML Compute framework.
Other
3.66k stars 308 forks source link

Only Apple's Tensorflow: Segmentation fault 11 / Abort trap 6 when running KataGo's training loop #258

Open MarkTakken opened 3 years ago

MarkTakken commented 3 years ago

Hi. I have upgraded the source code of KataGo (a Go-playing program, github.com/lightvector/KataGo) to Tensorflow 2, and the resulting code (github.com/MarkTakken/KataGoTF2MacOS) works fine with the official Tensorflow 2.4.0. However, when I try to run the training loop with Apple's Tensorflow, I get a "Segmentation fault: 11" or occasionally an "Abort trap: 6" error. You can recreate the error by running the following in the terminal (which will run smoothly with the official Tensorflow but crash with Apple's Tensorflow):

git clone https://github.com/MarkTakken/KataGoTF2MacOS.git cd KataGoTF2MacOS python/selfplay/train.sh TestRun testruntraining b6c96 128 trainonly

I have also found that the error is thrown in line 725 of train.py, that is, when the actual training begins. As this appears to be an issue solely with tensorflow_macos, I would greatly appreciate it if you could help identify and fix the problem. Thank you in advance.

MarkTakken commented 3 years ago

P.S. The test data (produced from selfplay) that this trains on is at TestRun/shuffleddata/current/train/data0.tfrecord.