LeelaChessZero / lczero-training

For code etc relating to the network training process.
147 stars 119 forks source link

Pipeline script for SL training #155

Closed teck45 closed 2 years ago

teck45 commented 3 years ago

This script feeds training data to train.py. We set training window, how much data to shift every train.py launch. At the same time script moves test data with same named folders generated by slsplit.sh. It can be edited to move test data according to size unrelated to folder's names. Since folders are moved instead of files script is very fast. Script utilizes data naming from storage server. To run it in cycle using FIFO principle (first in first out) at the end of cycle script renames all folders inside working folder (LCDATA), adds 1 to the name and they will move out first. Then it will be renamed back to original name (name-> 1name -> name). Since there are some if's there to keep iterating across 2 different folder's naming should be LCDATA and TESTLCDATA for working folder (where train.py looks), STORAGE and TESTSTORAGE for folder from where new data is moving and RESERVE and TESTRESERVE for folder where old data is moving from LCDATA and TESTLCDATA. It can be edited but there is no need to do so, just use this naming scheme. All 6 folders should be inside one folder. Script reads steps done inside text file trainstepslog.txt inside tf folder and write there steps done, to stop script smoothly (to keep leelalogs clean) we write stop word there ( "stop" without ""), it will stop smoothly and overwrite stop with stepsdone number. Script is tested and works fast and without errors/bugs, it is easy to use and saves a lot of time!

teck45 commented 3 years ago

If data from script needed in some another place there may be need to rename folders back (1name-> name). It can be done with this script (argument folder path) it will rename (1name -> name) all folders inside argument folder. PS inside script its completely fine to stop and resume, script will manage cycle and renaming automatically, its just for possibility to use data for another purposes.

shell script

function delfirstchar () {
INPUTDIR="$1" # folder where all folder will be renamed if needed (first char 1 will be removed)
cd "$1" #parentfolder for mv to work as name changer
for DIR in $INPUTDIR/*; do
  A=$(basename "$DIR")
  FIRSTCHAR=${A:0:1}
  if [[ "$FIRSTCHAR" -eq "1" ]];
  then
  mv "$DIR" "${A:1}"
  fi
done
}
delfirstchar $1
teck45 commented 3 years ago

To stop script we should write stop inside trainstepslog.txt file inside TF folder. It can be done in 1 sec using this command

echo stop >"/train/dev/lczero-training/tf/trainstepslog.txt"

Script will stop smoothly with clean leelalogs folder and overwrite stop word with stepsdone integer. So it will keep training steps in order :)