fairy-stockfish / variant-nnue-pytorch

chess variant NNUE training code for Fairy-Stockfish
https://github.com/fairy-stockfish/variant-nnue-pytorch/wiki/Introduction
28 stars 18 forks source link

Unknown error when trying to Train #28

Closed ACosmicCake closed 1 year ago

ACosmicCake commented 1 year ago

Hi,

For some reason i cannot get the code to work as it is not showing any errors currently. Could you help me figure out what is going on?

i am trying to create a nnue file for a 10x10 variant.

Could it be that the pytorch-lighning and pytorch is incompatible? Not sure why assertion would fail :(

There are two extra pieces, would i have to code them manually into the code?

sorry if these questions are simple i'm trying my best to learn.

Thank you so much for your dedication, the chess engine world is thankful for all this amazing work

ERROR

(siege) C:\Users\Kosmic\Desktop\variant-nnue-pytorch-master>python train.py --smart-fen-skipping --random-fen-skipping 3 --batch-size 16384 --threads 20 --num-workers 20 --gpus 1 C:\Users\Kosmic\Desktop\Variant\Validation-data\1mil9depth.bin C:\Users\Kosmic\Desktop\Variant\Validation-data\1mil12depth.bin Feature set: HalfKAv2^ Num real features: 150000 Num virtual features: 1600 Num features: 151600 Training with C:\Users\Kosmic\Desktop\Variant\Validation-data\1mil9depth.bin validating with C:\Users\Kosmic\Desktop\Variant\Validation-data\1mil12depth.bin Global seed set to 42 Seed 42 Using batch size 16384 Smart fen skipping: True Random fen skipping: 3 limiting torch to 20 threads. Using log dir logs/ C:\Users\Kosmic\anaconda3\envs\siege\Lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:487: LightningDeprecationWarning: Argument period in ModelCheckpoint is deprecated in v1.3 and will be removed in v1.5. Please use every_n_epochs instead. rank_zero_deprecation( C:\Users\Kosmic\anaconda3\envs\siege\Lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:432: UserWarning: ModelCheckpoint(save_last=True, save_top_k=None, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None). rank_zero_warn( ModelCheckpoint(save_last=True, save_top_k=-1, monitor=None) will duplicate the last checkpoint saved. GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs Using c++ data loader Assertion failed: bits <= 6, file C:/Users/Kosmic/Desktop/variant-nnue-pytorch-master/lib/nnue_trainLOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] ing_datRanger optimizer loaded. Gradient Centralization usage = False a_formats.h, line 662 Assertion failed: bits <= 6, file C:/Users/Kosmic/Desktop/variant-nnue-pytorch-master/lib/nnue_training_data_formats.h, line 662

| Name | Type | Params

0 | input | DoubleFeatureTransformerSlice | 78.8 M 1 | layer_stacks | LayerStacks | 152 K

79.0 M Trainable params 0 Non-trainable params 79.0 M Total params 315.939 Total estimated model params size (MB) Validation sanity check: 0it [00:00, ?it/s]C:\Users\Kosmic\anaconda3\envs\siege\Lib\site-packages\pytorch_lightning\trainer\data_loading.py:105: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 20 which is the number of cpus on this machine) in theDataLoader` init to improve performance. rank_zero_warn( Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s] (siege) C:\Users\Kosmic\Desktop\variant-nnue-pytorch-master>

env Package!

absl-py 1.4.0 pypi_0 pypi aiohttp 3.8.5 pypi_0 pypi aiosignal 1.3.1 pypi_0 pypi annotated-types 0.5.0 pypi_0 pypi ansicon 1.89.0 pypi_0 pypi anyio 3.7.1 pypi_0 pypi arrow 1.2.3 pypi_0 pypi async-timeout 4.0.3 pypi_0 pypi attrs 23.1.0 pypi_0 pypi backoff 2.2.1 pypi_0 pypi beautifulsoup4 4.12.2 pypi_0 pypi blessed 1.20.0 pypi_0 pypi bzip2 1.0.8 he774522_0 ca-certificates 2023.7.22 h56e8100_0 conda-forge cachetools 5.3.1 pypi_0 pypi certifi 2022.12.7 pypi_0 pypi charset-normalizer 2.1.1 pypi_0 pypi click 8.1.7 pypi_0 pypi colorama 0.4.6 pypi_0 pypi contourpy 1.1.0 pypi_0 pypi croniter 1.4.1 pypi_0 pypi cuda-version 11.8 h70ddcb2_2 conda-forge cudatoolkit 11.8.0 h09e9e62_12 conda-forge cupy 12.2.0 py311h77068d7_0 conda-forge cycler 0.11.0 pypi_0 pypi dateutils 0.6.12 pypi_0 pypi deepdiff 6.3.1 pypi_0 pypi fastapi 0.103.0 pypi_0 pypi fastrlock 0.8.2 py311h12c1d0e_0 conda-forge filelock 3.9.0 pypi_0 pypi fonttools 4.42.1 pypi_0 pypi frozenlist 1.4.0 pypi_0 pypi fsspec 2023.6.0 pypi_0 pypi future 0.18.3 pypi_0 pypi google-auth 2.22.0 pypi_0 pypi google-auth-oauthlib 1.0.0 pypi_0 pypi grpcio 1.57.0 pypi_0 pypi h11 0.14.0 pypi_0 pypi idna 3.4 pypi_0 pypi inquirer 3.1.3 pypi_0 pypi intel-openmp 2023.2.0 h57928b3_49496 conda-forge itsdangerous 2.1.2 pypi_0 pypi jinja2 3.1.2 pypi_0 pypi jinxed 1.2.0 pypi_0 pypi kiwisolver 1.4.5 pypi_0 pypi libblas 3.9.0 17_win64_mkl conda-forge libcblas 3.9.0 17_win64_mkl conda-forge libffi 3.4.4 hd77b12b_0 libhwloc 2.9.1 h51c2c0f_0 conda-forge libiconv 1.17 h8ffe710_0 conda-forge liblapack 3.9.0 17_win64_mkl conda-forge libxml2 2.10.4 h0ad7f3c_1 lightning 2.0.7 pypi_0 pypi lightning-cloud 0.5.37 pypi_0 pypi lightning-utilities 0.9.0 pypi_0 pypi markdown 3.4.4 pypi_0 pypi markdown-it-py 3.0.0 pypi_0 pypi markupsafe 2.1.2 pypi_0 pypi matplotlib 3.7.2 pypi_0 pypi mdurl 0.1.2 pypi_0 pypi mkl 2022.1.0 h6a75c08_874 conda-forge mpmath 1.2.1 pypi_0 pypi multidict 6.0.4 pypi_0 pypi networkx 3.0 pypi_0 pypi numpy 1.24.1 pypi_0 pypi oauthlib 3.2.2 pypi_0 pypi openssl 3.1.2 hcfcfb64_0 conda-forge ordered-set 4.1.0 pypi_0 pypi packaging 23.1 pypi_0 pypi pillow 9.3.0 pypi_0 pypi pip 23.2.1 py311haa95532_0 protobuf 4.24.2 pypi_0 pypi psutil 5.9.5 pypi_0 pypi pthreads-win32 2.9.1 hfa6e2cd_3 conda-forge pyasn1 0.5.0 pypi_0 pypi pyasn1-modules 0.3.0 pypi_0 pypi pydantic 2.1.1 pypi_0 pypi pydantic-core 2.4.0 pypi_0 pypi pydeprecate 0.3.1 pypi_0 pypi pygments 2.16.1 pypi_0 pypi pyjwt 2.8.0 pypi_0 pypi pyparsing 3.0.9 pypi_0 pypi python 3.11.4 he1021f5_0 python-chess 0.31.4 pypi_0 pypi python-dateutil 2.8.2 pypi_0 pypi python-editor 1.0.4 pypi_0 pypi python-multipart 0.0.6 pypi_0 pypi python_abi 3.11 2_cp311 conda-forge pytorch-lightning 1.4.9 pypi_0 pypi pytz 2023.3 pypi_0 pypi pyyaml 6.0.1 pypi_0 pypi readchar 4.0.5 pypi_0 pypi requests 2.28.1 pypi_0 pypi requests-oauthlib 1.3.1 pypi_0 pypi rich 13.5.2 pypi_0 pypi rsa 4.9 pypi_0 pypi setuptools 68.0.0 py311haa95532_0 six 1.16.0 pypi_0 pypi sniffio 1.3.0 pypi_0 pypi soupsieve 2.4.1 pypi_0 pypi sqlite 3.41.2 h2bbff1b_0 starlette 0.27.0 pypi_0 pypi starsessions 1.3.0 pypi_0 pypi sympy 1.11.1 pypi_0 pypi tbb 2021.9.0 h91493d7_0 conda-forge tensorboard 2.14.0 pypi_0 pypi tensorboard-data-server 0.7.1 pypi_0 pypi tk 8.6.12 h2bbff1b_0 torch 2.0.1+cu118 pypi_0 pypi torchaudio 2.0.2+cu118 pypi_0 pypi torchmetrics 0.7.0 pypi_0 pypi torchvision 0.15.2+cu118 pypi_0 pypi tqdm 4.66.1 pypi_0 pypi traitlets 5.9.0 pypi_0 pypi typing-extensions 4.7.1 pypi_0 pypi tzdata 2023c h04d1e81_0 ucrt 10.0.22621.0 h57928b3_0 conda-forge urllib3 1.26.13 pypi_0 pypi uvicorn 0.23.2 pypi_0 pypi vc 14.2 h21ff451_1 vc14_runtime 14.36.32532 hfdfe4a8_17 conda-forge vs2015_runtime 14.36.32532 h05e6639_17 conda-forge wcwidth 0.2.6 pypi_0 pypi websocket-client 1.6.2 pypi_0 pypi websockets 11.0.3 pypi_0 pypi werkzeug 2.3.7 pypi_0 pypi wheel 0.38.4 py311haa95532_0 xz 5.4.2 h8cc25b3_0 yarl 1.9.2 pypi_0 pypi zlib 1.2.13 h8cc25b3_0

ianfab commented 1 year ago

https://github.com/fairy-stockfish/variant-nnue-pytorch/wiki/FAQ#assertion-failed-bits--6

ACosmicCake commented 1 year ago

Hi, Thank you so much for the quick reply! You are amazing. It has started training.

But If i may i would like to ask something i am confused about.

So to generate training data we can either use the classical Eval, or nnue. And it is recommended to use Nnue for better training data evaluations.

But if there is no nnue for the custom variant, i would have to use the classical eval. Then after that train with the data generated with pytorch nnue to create a nnue file.

And if i use the nnue eval file back to generate more training data with the nnue, and do the whole process again. Wouldn't that be a perpetual cycle that doesnt improve on itself? since it is using the data from the classival eval file, and not a purely generated data from nnue playing.

Thank you so much for your time. <3

ianfab commented 1 year ago

Yes, in principle a loop of "take best eval -> generate training data -> train -> get better eval" is working and that basically is what we are doing. There are diminishing returns though, so progress slows down and eventually stops.

ACosmicCake commented 1 year ago

Thank you so much!