Training script crashes after hitting a checkpoint

LeelaChessZero / lczero-training

For code etc relating to the network training process.

147 stars 119 forks source link

Hi, I tried discussing this on Discord, but I didn't see a clear response. The issue that I am facing currently is that after every checkpoint, or in some cases after every 1000 cases an assertion error is thrown. message.txt I have tried various fixes to this problem (changing the parser, changing CUDA version, changing Python version, changing TF version etc). But it continues to persist. I referred the following link and tried to fix it, but it still didn't work: https://github.com/tensorflow/tensorflow/issues/35100 Please help me rectify this issue.

My current PC configuration: 8 GB RAM Intel i7-4790K NVIDIA RTX 2070 SUPER 1TB SSD

My current requirement setup: CUDA 11.3 CUDNN 8.2.1 Python 3.9.5 TF-Nightly GPU (2.7.0 dev)

Thank you.

LeelaChessZero / lczero-training

Training script crashes after hitting a checkpoint #159