NVIDIA / framework-reproducibility

Providing reproducibility in deep learning frameworks
Apache License 2.0
423 stars 40 forks source link

How can I confirm that the deterministic environment variable is working? #2

Closed cklsoft closed 4 years ago

cklsoft commented 4 years ago

I've use the newest NGC container, and specify os.environ['TF_DETERMINISTIC_OPS'] = '1' at the begin of my main entry. But I don't know wether the environ is work or not. I have set tf log level to debug and didn't find any tf determinism relate log.

duncanriach commented 4 years ago

Hi @cklsoft, there is currently nothing printed to the logs by TensorFlow that confirms that the environment variable has been acted upon. I have made a note to potentially add this in the future.

cklsoft commented 4 years ago

there is currently nothing printed to the logs by TensorFlow that confirms that the environment variable has been acted upon. I have made a note to potentially add this in the future.

Will TF_DETERMINISTIC_OPS increment the number of graph nodes?

duncanriach commented 4 years ago

That's a great idea.

Yes, if you observe the number of graph nodes when running with TF_DETERMINISTIC_OPS not set (or set to '0' or 'false') and then observe them again with TF_DETERMINISTIC_OPS set to '1' or 'true' then you should see the number of graph nodes increase with the current implementation (NGC 19.06, NGC 19.07, and stock TF 1.14).

Note that TF_DETERMINISTIC_OPS is sticky in the python process; it's queried and then cached by TensorFlow the first time it's used. So, to operate without it, you need to run from scratch.

duncanriach commented 4 years ago

The ultimate test is whether your weights at the end of training change from run to run.

For Keras models, you can call the following at the end of training, and make sure it produces the same result on two consecutive runs:

def summarize_keras_weights(model):
  weights = model.get_weights()
  summary = sum(map(lambda x: x.sum(), weights))
  print("Summary of weights: %.13f" % summary)

If you're not using Keras, it would look something like this:

def summarize_weights(session):
  if hasattr(session, 'raw_session'): session = session.raw_session()
  weights = session.run(tf.trainable_variables())
  summary = sum(map(lambda x: x.sum(), weights))
  print("Summary of weights: %.13f" % summary)

It's also good to confirm that your weights are the same, on both runs, before training starts.

Please note that while the above code is based on code I've used, the code as given above has not been tested. It may contain bugs and/or may not work on more recent versions of TensorFlow or Keras.

duncanriach commented 4 years ago

This question has been answered, and there is nothing else to be done here. Closing.