net_to_model.py - ValueError: Number of filters in YAML doesn't match the network

namebrandon commented 5 years ago

Using t40.yml from the branch on github, using network 40800 (tried others, same result). There are no games in the input_test / train directories, as I'm just trying to generate the model at this point. If I pass in a t30 network with the t40.yml I get much further (scroll down)

input (base) D:\Chess\Training\lczero-training\tf>python net_to_model.py --cfg=../configs/t40.yml 40800

output

dataset:
  input_test: D:\\Chess\\Training\\lczero-training\\games\test\b001
  input_train: D:\\Chess\\Training\\lczero-training\\games\\train\b001
  num_chunks: 500000
  train_ratio: 0.9
gpu: 0
model:
  filters: 256
  policy_channels: 80
  residual_blocks: 20
  se_ratio: 8
name: 256x20-t40
training:
  batch_size: 4096
  checkpoint_steps: 10000
  lr_boundaries:
  - 100
  lr_values:
  - 0.02
  - 0.02
  max_grad_norm: 2
  num_batch_splits: 8
  path: D:\\Chess\\Training\\lczero-training\\networks
  policy_loss_weight: 1.0
  shuffle_size: 500000
  swa: true
  swa_max_n: 10
  swa_steps: 25
  test_steps: 125
  total_steps: 250
  train_avg_report_steps: 25
  value_loss_weight: 1.0
  warmup_steps: 125

Traceback (most recent call last):
  File "net_to_model.py", line 25, in <module>
    raise ValueError("Number of filters in YAML doesn't match the network")
ValueError: Number of filters in YAML doesn't match the network

YAML for reference

%YAML 1.2
---
name: '256x20-t40'                  # ideally no spaces
gpu: 0                                 # gpu id to process on

dataset: 
  num_chunks: 500000                   # newest nof chunks to parse
  train_ratio: 0.90                    # trainingset ratio
  # For separated test and train data.
  input_train: 'D:\\Chess\\Training\\lczero-training\\games\\train\b001' # supports glob
  input_test: 'D:\\Chess\\Training\\lczero-training\\games\test\b001'  # supports glob
  # For a one-shot run with all data in one directory.
  #input: '/work/lc0/data/'

training:
    swa: true
    swa_steps: 25
    swa_max_n: 10
    max_grad_norm: 2
    batch_size: 4096                   # training batch
    num_batch_splits: 8
    test_steps: 125                    # eval test set values after this many steps
    train_avg_report_steps: 25        # training reports its average values after this many steps.
    total_steps: 250                  # terminate after these steps
    warmup_steps: 125
    checkpoint_steps: 10000          # optional frequency for checkpointing before finish
    shuffle_size: 500000               # size of the shuffle buffer
    lr_values:                         # list of learning rates
        - 0.02
        - 0.02
    lr_boundaries:                     # list of boundaries
        - 100
    policy_loss_weight: 1.0            # weight of policy loss
    value_loss_weight: 1.0            # weight of value loss
    path: 'D:\\Chess\\Training\\lczero-training\\networks'          # network storage dir

model:
  filters: 256
  residual_blocks: 20
  se_ratio: 8
  policy_channels: 80
...

T30 attempt

D:\Chess\Training\lczero-training\tf>python net_to_model.py --cfg=../configs/t40.yml 32890

output

dataset:
  input_test: D:\\Chess\\Training\\lczero-training\\games\test\\b001
  input_train: D:\\Chess\\Training\\lczero-training\\games\\train\\b001
  num_chunks: 500000
  train_ratio: 0.9
gpu: 0
model:
  filters: 256
  policy_channels: 80
  residual_blocks: 20
  se_ratio: 8
name: 256x20-t40
training:
  batch_size: 4096
  checkpoint_steps: 10000
  lr_boundaries:
  - 100
  lr_values:
  - 0.02
  - 0.02
  max_grad_norm: 2
  num_batch_splits: 8
  path: D:\\Chess\\Training\\lczero-training\\networks
  policy_loss_weight: 1.0
  shuffle_size: 500000
  swa: true
  swa_max_n: 10
  swa_steps: 25
  test_steps: 125
  total_steps: 250
  train_avg_report_steps: 25
  value_loss_weight: 1.0
  warmup_steps: 125

2019-02-16 13:31:07.643530: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX
2019-02-16 13:31:07.918932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2019-02-16 13:31:07.927027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-16 13:31:08.403139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-16 13:31:08.407630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-16 13:31:08.410846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-16 13:31:08.414238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10137 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
WARNING:tensorflow:From D:\Chess\Training\lczero-training\tf\tfprocess.py:144: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

Traceback (most recent call last):
  File "net_to_model.py", line 39, in <module>
    tfp.replace_weights(weights)
  File "D:\Chess\Training\lczero-training\tf\tfprocess.py", line 302, in replace_weights
    new_weight = tf.constant(new_weights[e], shape=weights.shape)
  File "D:\Users\brandon\Miniconda3\lib\site-packages\tensorflow\python\framework\constant_op.py", line 208, in constant
    value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "D:\Users\brandon\Miniconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py", line 497, in make_tensor_proto
    (shape_size, nparray.size))
ValueError: Too many elements provided. Needed at most 256, but received 589824

Ttl commented 5 years ago

The newest code has some backwards incompatible changes and T30 nets can't be trained with the current code. You need to use code from before the SE commit.

For restoring T40 net you need to specify policy: classical and value: classical in yaml, since defaults are now changed to convolutional policy and WDL value head.

namebrandon commented 5 years ago

The newest code has some backwards incompatible changes and T30 nets can't be trained with the current code. You need to use code from before the SE commit.

For restoring T40 net you need to specify policy: classical and value: classical in yaml, since defaults are now changed to convolutional policy and WDL value head.

Thank you, super helpful! Is there a specific branch for pre-SE or any idea when that commit was? I was poking around trying to find code before / after se_ratio: showed up but was still seeing some errors when I thought I found the right code version..

LeelaChessZero / lczero-training

net_to_model.py - ValueError: Number of filters in YAML doesn't match the network #65