LeelaChessZero / lczero-training

For code etc relating to the network training process.
147 stars 119 forks source link

slsplit.sh #153

Closed teck45 closed 2 years ago

teck45 commented 3 years ago

Simple script for separating train and test data. Consumes folder with untarred folders with train data from server. Creates directories with same name in test folder and move % of chunks there (Monte Carlo related method :) ).

teck45 commented 3 years ago

Before (train data du) 402M /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-0917 311M /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-1017 328M /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-1517 514M /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-1817 478M /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-1917 after (train data du) 361M /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-0917 281M /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-1017 294M /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-1517 462M /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-1817 test data du 41M /content/drive/MyDrive/1pipelinescript/SPLITOUTPUT/training-run1-test60-20210316-0917 31M /content/drive/MyDrive/1pipelinescript/SPLITOUTPUT/training-run1-test60-20210316-1017 30M /content/drive/MyDrive/1pipelinescript/SPLITOUTPUT/training-run1-test60-20210316-1117 34M /content/drive/MyDrive/1pipelinescript/SPLITOUTPUT/training-run1-test60-20210316-1517 script output TRAIN TEST DATA SPLITTING SCRIPT Randomly moving 10 percent of chunks from /content/drive/MyDrive/1pipelinescript/LCDATA to /content/drive/MyDrive/1pipelinescript/folder2 folder_in_process /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-0917 folder_in_process /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-1017 folder_in_process /content/drive/MyDrive/1pipelinescript/LCDATA/training-run1-test60-20210316-1117

Tilps commented 3 years ago

This is probably vastly slower than a script wrapping https://github.com/LeelaChessZero/lczero-training/blob/master/scripts/initsplit.py

teck45 commented 3 years ago

This is probably vastly slower than a script wrapping https://github.com/LeelaChessZero/lczero-training/blob/master/scripts/initsplit.py

Script is easy and ready. Just 2 inputs and 3rd is percentage parameter. (I can edit to $1 $2 in one sec and parameters will be ./split.sh path/1/ path/2/ when we start script ) Speed on colab was 1 gb of data processed per minute (100 mb of chunks moved to test folder). On good computer it should be much faster, on nvme ssd it was 12 seconds for 1 gb processed 100 mb of data moved to test folders. I suggest to try my script, its really nice! Its exactly what needed for SL runs! PS: I also have pipeline script for SL runs, which works paired with this script - since it uploads and slides same named folders for train and test-chunks folder. Pipeline script is on end testing stage :)