Closed itaijj closed 1 year ago
Thanks for reporting this and analyzing the issue. If we find a solution we will post it here. In case you find it, feel free to contribute it back following this: https://github.com/GT4SD/gt4sd-core/blob/main/CONTRIBUTING.md
Hi @itaijj Thx for the interest in GT4SD. Have you tried fixing all torch random seeds? PaccMannGP uses torch, not TF.
I tried also torch random seeds before that and it didn't work. I see that you also use gaussian process modules from scikit, do you know if it can become deterministic somehow too? https://scikit-optimize.github.io/stable/modules/generated/skopt.gp_minimize.html
skopt.gp_minimize
has a random_state
input so, yes!
Maybe try fixing this parameter on top of your current changes. Let us know whether it works
I see that he do uses the random state you give him, so any idea why it doesn't work? https://github.com/scikit-optimize/scikit-optimize/blob/a2369ddbc332d16d8ff173b12404b03fea472492/skopt/optimizer/optimizer.py#L281
The sklearn/skopt handling of the random state seems sane. The seed is passed down to the initializer in the code you linked.
A separate source of stochasticity is in the SMILES generator which uses SamplingSearch
from paccmann_chemistry
during decoding, see here. The source code in paccmann_chemistry is here:
But this should be affected by setting the global torch seed. As a next step I recommend to verify whether the remaining randomness comes from the skopt component or from the model (smiles generation).
Remember that skopt is called after the smiles generator in every optimization step, so if the smiles generator is still stochastic, fixing the seed in skopt wont have any effect.
There's also random jitter being added to the latent point here.
Model is being set in eval
mode here
I see, thanks. I tried currently to set the global seeds like this before creating and generating using the PaccMannGP model def seed_everything(seed=42): random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) np.random.seed(seed) torch.manual_seed(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False
I still get different results each time, any other idea how to set things?
after debugging the code for a while, I think it saves each step generated_smiles in unordered set, https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/controlled_sampling/paccmann_gp/implementation.py#L242 and then it eventually takes the first X required elements from it , and this is why it's random
because the time it takes is not deterministic I think the output is not deteministic and generate different number of molecules each time.
The core.py
is just the general wrapper for executing any GT4SD algorithm. It should not induce stochasticity. Have you tried replacing the SamplingSearch
with a GreedySearch
instead?
I found out if ordered smiles_set has the same molecules, because we sample num_items from the unordered set we get randomness!
Interesting, have you tried adding a sorted()
to the list returned here: https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/controlled_sampling/paccmann_gp/implementation.py#L270
it fixes it, but I don't want to change the code
I can bypass it by having same num_samples and maximum_number_of_sampling_steps and then sorting the generated molecules in my code
I think you're right in flagging this as a "bug". We shouldn't rely on those two parameters to be identical.
The problem with sorted(list(...
is that it induces a bias when the whole list is longer than the final one.
But this we can fix it:
mols = sorted(list([s for s in smiles_set if s]))
np.random.shuffle(mols)
If the seed is fixed, the shuffling will be deterministic. It would be great if you could submit a PR for this. Otherwise I will try to do it timely
I actually see that in paccmann_rl you actually sort the molecules in set so there is no similar problem https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/conditional_generation/paccmann_rl/implementation.py#L140
how does it not induces a bias there?
It does induce a bias too, in this case the molecules are sorted in reverse order by length, so short molecules are ignored.
But note that this is a legacy code snippet that is not executed when running PaccMannRL. Good catch, actually, we can remove this function. The function that is executed is this: https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/conditional_generation/paccmann_rl/implementation.py#L70
which is called from generate_batch(
Submitted a PR in #226 to fix this @itaijj
Describe the bug PaccmannGP does not have deterministic generator behaviour although there is a code that sets a seed here https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/controlled_sampling/paccmann_gp/core.py#L191 it seems that this seed is only used for the optimizer but not for other sources of randomness in the code https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/controlled_sampling/paccmann_gp/implementation.py#L237C36-L237C36
I also tried the following after some googling without success, it seems the seed should also be transferred to some network layer operation which I can't change when the model is fixed. import numpy as np import random import tensorflow as tf
def seed_everything(seed=1234): random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) np.random.seed(seed) tf.random.set_seed(seed)
def set_global_determinism(seed=1234): seed_everything(seed=seed) os.environ['TF_DETERMINISTIC_OPS'] = '1' os.environ['TF_CUDNN_DETERMINISTIC'] = '1' tf.config.threading.set_inter_op_parallelism_threads(1) tf.config.threading.set_intra_op_parallelism_threads(1)