LeelaChessZero / lczero-training

For code etc relating to the network training process.
143 stars 119 forks source link

ZeroDivisionError in squeeze_excitation function due to SE_ratio being set to zero #224

Open ChinChangYang opened 6 months ago

ChinChangYang commented 6 months ago

Description I encountered a ZeroDivisionError during the execution of the net_to_model.py script in the lczero-training project. The error occurs in the squeeze_excitation function within the tfprocess.py file, specifically at the line where it asserts that the number of channels is evenly divisible by self.SE_ratio. The subsequent division operation leads to a division by zero, indicating that self.SE_ratio is inadvertently set to zero.

Steps to Reproduce

  1. Clone the lczero-training repository with submodules.
  2. Install necessary Python packages: numpy, tensorflow, protobuf.
  3. Download specific network weights and configuration files.
  4. Initialize and run the training setup as per the provided instructions.
  5. The error occurs during the execution of the net_to_model.py script, specifically when the squeeze_excitation function is called.

Expected Behavior The squeeze_excitation function should execute without errors, processing the input tensor by applying squeeze and excitation operations based on a non-zero SE_ratio.

Actual Behavior The execution fails with a ZeroDivisionError, indicating that self.SE_ratio is set to zero, which is not expected. The traceback points to the squeeze_excitation function in tfprocess.py.

Environment https://colab.research.google.com/drive/1a3lkH1IUG-P_Y7scNjenmTmRdJ0RF_5R?usp=sharing

Additional Context The error suggests a misconfiguration or an oversight in the initialization of the SE_ratio. This parameter is crucial for the squeeze-excitation operation, and it should be a positive integer that divides the number of channels without remainder. It's possible that this is either a code bug or a configuration issue.

Here's the relevant portion of the error message for quick reference:

Traceback (most recent call last):
  File "/content/lczero-training/tf/net_to_model.py", line 28, in <module>
    tfp.init_net()
  File "/content/lczero-training/tf/tfprocess.py", line 383, in init_net
    outputs = self.construct_net(input_var)
  File "/content/lczero-training/tf/tfprocess.py", line 1529, in construct_net
    flow = self.create_residual_body(inputs)
  File "/content/lczero-training/tf/tfprocess.py", line 1424, in create_residual_body
    flow = self.residual_block(flow,
  File "/content/lczero-training/tf/tfprocess.py", line 1248, in residual_block
    out2 = self.squeeze_excitation(self.batch_norm(conv2,
  File "/content/lczero-training/tf/tfprocess.py", line 1196, in squeeze_excitation
    assert channels % self.SE_ratio == 0
ZeroDivisionError: integer division or modulo by zero

I would appreciate any insights into this issue or suggestions on how to properly configure the SE_ratio to avoid this error.

EDIT

borg323 commented 6 months ago

This goes back to #58, when support for non SE CNNs was dropped.