dingo-gw / dingo

Dingo: Deep inference for gravitational-wave observations
MIT License
55 stars 19 forks source link

Fix bug related to random seeds #233

Closed max-dax closed 10 months ago

max-dax commented 10 months ago

This fixes a bug, where each worker was initialised with the same random seed due to a change made in bilby.

This bug likely lead to identical samples for extrinsic parameters, which corrupted the neural network training in the following way. The network could just memorise the parameters from the previous workers, without looking at the strain data. With N data loader workers, this resulted in a pattern where the loss would rapidly decrease over N iterations, as the network memorised the parameters. In iteration N+1, the loss steeply increased (as all workers would generate new random numbers), and then rapidly decrease again until iteration 2N.

stephengreen commented 10 months ago

It looks good to me, but needs to be tested @nihargupte-ph

nihargupte-ph commented 10 months ago

Looks fine to me. I ran a model starting at epoch 200 through the debugger. I am not sure why but the loss is starting higher than I expected (12). Nonetheless, I find that the loss decreases after 32 iterations (the number of workers) but does not again increase. I attach a log file here. log.txt

stephengreen commented 10 months ago

Looks fine to me. I ran a model starting at epoch 200 through the debugger. I am not sure why but the loss is starting higher than I expected (12). Nonetheless, I find that the loss decreases after 32 iterations (the number of workers) but does not again increase. I attach a log file here. log.txt

The reason for this could be that the random seed for each worker is reset to the same value at the beginning of each epoch. This might be the case if the DataLoader creates new workers each epoch, although I don't know why the seed would be the same, unless Bilby is setting it somewhere. In any case, this would in principle allow the network to overfit to the training data from epoch to epoch, and could result in the behavior you are seeing.

stephengreen commented 10 months ago

Docs are failing to build. Could someone @nihargupte-ph @max-dax investigate this? We will need docs to build before we can merge.

nihargupte-ph commented 10 months ago

Docs are failing to build. Could someone @nihargupte-ph @max-dax investigate this? We will need docs to build before we can merge.

So I think it's because the right version of docutils isn't not being used in the CI script. This line,

pip install --force-reinstall "docutils<0.17"

should be replaced with

python -m pip install --force-reinstall "docutils==0.18.1"

Because the docs are failing to build with 0.17:

Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/dingo-gw/envs/233/bin/sphinx-apidoc", line 5, in <module>
    from sphinx.ext.apidoc import main
  File "/home/docs/checkouts/readthedocs.org/user_builds/dingo-gw/envs/233/lib/python3.10/site-packages/sphinx/ext/apidoc.py", line 28, in <module>
    from sphinx.cmd.quickstart import EXTENSIONS
  File "/home/docs/checkouts/readthedocs.org/user_builds/dingo-gw/envs/233/lib/python3.10/site-packages/sphinx/cmd/quickstart.py", line 34, in <module>
    from sphinx.util.console import (  # type: ignore[attr-defined]
  File "/home/docs/checkouts/readthedocs.org/user_builds/dingo-gw/envs/233/lib/python3.10/site-packages/sphinx/util/__init__.py", line 24, in <module>
    from sphinx.util.nodes import (  # noqa: F401
  File "/home/docs/checkouts/readthedocs.org/user_builds/dingo-gw/envs/233/lib/python3.10/site-packages/sphinx/util/nodes.py", line 12, in <module>
    from sphinx import addnodes
  File "/home/docs/checkouts/readthedocs.org/user_builds/dingo-gw/envs/233/lib/python3.10/site-packages/sphinx/addnodes.py", line 18, in <module>
    'meta': (nodes.meta, 'docutils.nodes.meta'),  # type: ignore[attr-defined]
AttributeError: module 'docutils.nodes' has no attribute 'meta'
stephengreen commented 10 months ago

should be replaced with

python -m pip install --force-reinstall "docutils==0.18.1"

I just removed this line and it compiles fine. Thanks.