Zeta36 / chess-alpha-zero

Chess reinforcement learning by AlphaGo Zero methods.
MIT License
2.13k stars 480 forks source link

GPU error #73

Open fredzfm opened 6 years ago

fredzfm commented 6 years ago

Tried to run it with GPU. got the following error. can anyone help me on this?

(Python36) D:\chess\chess-alpha-zero>python src/chess_zero/run.py self 2018-10-10 11:45:55,139@chess_zero.manager INFO # config type: mini Using TensorFlow backend. 2018-10-10 11:45:59,436@chess_zero.agent.model_chess DEBUG # loading model from D:\chess\chess-alpha-zero\data\model\model_best_config.json 2018-10-10 11:45:59.478648: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2018-10-10 11:45:59.695745: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:01:00.0 totalMemory: 11.00GiB freeMemory: 9.10GiB 2018-10-10 11:45:59.790370: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:02:00.0 totalMemory: 11.00GiB freeMemory: 9.10GiB 2018-10-10 11:45:59.795932: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0, 1 2018-10-10 11:48:20.448740: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-10-10 11:48:20.451530: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929] 0 1 2018-10-10 11:48:20.453788: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0: N N 2018-10-10 11:48:20.455816: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 1: N N 2018-10-10 11:48:20.458363: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8795 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) 2018-10-10 11:48:20.834375: I C:\users\nwani_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8795 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1) Traceback (most recent call last): File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1567, in _create_c_op c_op = c_api.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "src/chess_zero/run.py", line 20, in manager.start() File "src\chess_zero\manager.py", line 64, in start return self_play.start(config) File "src\chess_zero\worker\self_play.py", line 25, in start return SelfPlayWorker(config).start() File "src\chess_zero\worker\self_play.py", line 45, in init self.current_model = self.load_model() File "src\chess_zero\worker\self_play.py", line 85, in load_model if self.config.opts.new or not load_best_model_weight(model): File "src\chess_zero\lib\model_helper.py", line 15, in load_best_model_weight return model.load(model.config.resource.model_best_config_path, model.config.resource.model_best_weight_path) File "src\chess_zero\agent\model_chess.py", line 145, in load self.model = Model.from_config(json.load(f)) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\engine\network.py", line 1032, in from_config process_node(layer, node_data) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\engine\network.py", line 991, in process_node layer(unpack_singleton(input_tensors), kwargs) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\engine\base_layer.py", line 457, in call output = self.call(inputs, kwargs) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\layers\normalization.py", line 206, in call training=training) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 3123, in in_train_phase x = switch(training, x, alt) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 3058, in switch else_expression_fn) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\util\deprecation.py", line 432, in new_func return func(*args, **kwargs) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2072, in cond orig_res_f, res_f = context_f.BuildCondBranch(false_fn) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 1913, in BuildCondBranch original_result = fn() File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\layers\normalization.py", line 167, in normalize_inference epsilon=self.epsilon) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 1908, in batch_normalization mean = tf.reshape(mean, (-1)) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 6112, in reshape "Reshape", tensor=tensor, shape=shape, name=name) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op op_def=op_def) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1734, in init control_input_ops) File "C:\Users\isszfm\AppData\Local\Continuum\anaconda3\envs\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1570, in _create_c_op raise ValueError(str(e)) ValueError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].

brianprichardson commented 6 years ago

It has been a while, but for a "self" play run try without any weight file and it should create one to start with. The best weights are for "uci" play.

fredzfm commented 6 years ago

Thanks brianprichardson.

You mean it never runs with GPU? I have tried "self" without any Json model and weight file, still got error.

fredzfm commented 6 years ago

tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].

brianprichardson commented 6 years ago

Depending on the situiation, the weights (.h5) and model (.json) files must match the net architecture in the configs file (typically mini.py). The stronger ones that I uploaded do not match the current config files.

IIRC, when running "self" if there are no .h5 and .json files they will be created first. You can add self.model.summary() at the end of the def build(self): in class ChessModel: in model_chess.py in the agent dir to see if it is creating a new model from the specs in the mini.py file.

For running "uci" it just tries to read the best files. Other params in the config file can still be set, but most are ignored for uci, like playouts is 1,200 (sort of like fixed number of nodes).

The first output you posted shows it is trying to run with the gpu. As slow as it is, it will be far to slow to run without a gpu, and your 1080ti is a very good one.

I would try a clean download and just try to run with "uci" and enter the "uci" and "isready" (remember to wait for the readyok), and then "go". You should get a bestmove after some time. If that works, then your packages and gpu are all working ok and we can work from there.

What are you trying to do, in general? Self-play training is extremely slow and takes a lot of disk space for the intermediate input plane files. That's why I have a tweaked version that takes pgn input and trains directly from that.

reikdas commented 6 years ago

This issue might be related to #75 and #76 .

I would try a clean download and just try to run with "uci" and enter the "uci" and "isready" (remember to wait for the readyok), and then "go". You should get a bestmove after some time. If that works, then your packages and gpu are all working ok and we can work from there.

What is the command to run this? python src/chess_zero/run.py uci --isready does not work.

brianprichardson commented 6 years ago

First only do: python src/chess_zero/run.py uci

Then, after it loads enter: uci [wait for uciok] isready [wait for readyok] go [should see some bestmove output but may take some time with cpu and gpu busy]

reikdas commented 6 years ago

I get the following error logs when I issue isready

Using TensorFlow backend.
2018-11-11 18:52:28.546655: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-11-11 18:52:28.667415: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-11 18:52:28.667979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: 
name: GeForce GTX 1070 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.2655
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 7.51GiB
2018-11-11 18:52:28.667993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-11-11 18:52:29.920908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-11 18:52:29.920943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0 
2018-11-11 18:52:29.920952: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N 
2018-11-11 18:52:29.921135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7243 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1626, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/chess_zero/run.py", line 20, in <module>
    manager.start()
  File "src/chess_zero/manager.py", line 76, in start
    return uci.start(config)
  File "src/chess_zero/play_game/uci.py", line 31, in start
    me_player = get_player(config)
  File "src/chess_zero/play_game/uci.py", line 67, in get_player
    if not load_best_model_weight(model):
  File "src/chess_zero/lib/model_helper.py", line 15, in load_best_model_weight
    return model.load(model.config.resource.model_best_config_path, model.config.resource.model_best_weight_path)
  File "src/chess_zero/agent/model_chess.py", line 145, in load
    self.model = Model.from_config(json.load(f))
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 1032, in from_config
    process_node(layer, node_data)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 991, in process_node
    layer(unpack_singleton(input_tensors), **kwargs)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/layers/normalization.py", line 206, in call
    training=training)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3123, in in_train_phase
    x = switch(training, x, alt)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3058, in switch
    else_expression_fn)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2087, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1920, in BuildCondBranch
    original_result = fn()
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/layers/normalization.py", line 167, in normalize_inference
    epsilon=self.epsilon)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1908, in batch_normalization
    mean = tf.reshape(mean, (-1))
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6296, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1790, in __init__
    control_input_ops)
  File "/home/pratyush/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1629, in _create_c_op
    raise ValueError(str(e))
ValueError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].
adangert commented 5 years ago

I got the same errors: ValueError: Shape must be rank 1 but is rank 0 for 'input_batchnorm/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,256,1,1], [].

brianprichardson commented 5 years ago

See #75 there is a link a fork with a working version.