GT4SD / gt4sd-core

GT4SD, an open-source library to accelerate hypothesis generation in the scientific discovery process.
https://gt4sd.github.io/gt4sd-core/
MIT License
336 stars 74 forks source link

Bug with random seed in paccman generator #224

Closed itaijj closed 1 year ago

itaijj commented 1 year ago

Describe the bug PaccmannGP does not have deterministic generator behaviour although there is a code that sets a seed here https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/controlled_sampling/paccmann_gp/core.py#L191 it seems that this seed is only used for the optimizer but not for other sources of randomness in the code https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/controlled_sampling/paccmann_gp/implementation.py#L237C36-L237C36

I also tried the following after some googling without success, it seems the seed should also be transferred to some network layer operation which I can't change when the model is fixed. import numpy as np import random import tensorflow as tf

def seed_everything(seed=1234): random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) np.random.seed(seed) tf.random.set_seed(seed)

def set_global_determinism(seed=1234): seed_everything(seed=seed) os.environ['TF_DETERMINISTIC_OPS'] = '1' os.environ['TF_CUDNN_DETERMINISTIC'] = '1' tf.config.threading.set_inter_op_parallelism_threads(1) tf.config.threading.set_intra_op_parallelism_threads(1)

drugilsberg commented 1 year ago

Thanks for reporting this and analyzing the issue. If we find a solution we will post it here. In case you find it, feel free to contribute it back following this: https://github.com/GT4SD/gt4sd-core/blob/main/CONTRIBUTING.md

jannisborn commented 1 year ago

Hi @itaijj Thx for the interest in GT4SD. Have you tried fixing all torch random seeds? PaccMannGP uses torch, not TF.

itaijj commented 1 year ago

I tried also torch random seeds before that and it didn't work. I see that you also use gaussian process modules from scikit, do you know if it can become deterministic somehow too? https://scikit-optimize.github.io/stable/modules/generated/skopt.gp_minimize.html

jannisborn commented 1 year ago

skopt.gp_minimize has a random_state input so, yes!

Maybe try fixing this parameter on top of your current changes. Let us know whether it works

itaijj commented 1 year ago

I see that he do uses the random state you give him, so any idea why it doesn't work? https://github.com/scikit-optimize/scikit-optimize/blob/a2369ddbc332d16d8ff173b12404b03fea472492/skopt/optimizer/optimizer.py#L281

jannisborn commented 1 year ago

The sklearn/skopt handling of the random state seems sane. The seed is passed down to the initializer in the code you linked.

A separate source of stochasticity is in the SMILES generator which uses SamplingSearch from paccmann_chemistry during decoding, see here. The source code in paccmann_chemistry is here: But this should be affected by setting the global torch seed. As a next step I recommend to verify whether the remaining randomness comes from the skopt component or from the model (smiles generation). Remember that skopt is called after the smiles generator in every optimization step, so if the smiles generator is still stochastic, fixing the seed in skopt wont have any effect.

There's also random jitter being added to the latent point here.

jannisborn commented 1 year ago

Model is being set in eval mode here

itaijj commented 1 year ago

I see, thanks. I tried currently to set the global seeds like this before creating and generating using the PaccMannGP model def seed_everything(seed=42): random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) np.random.seed(seed) torch.manual_seed(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False

I still get different results each time, any other idea how to set things?

itaijj commented 1 year ago

after debugging the code for a while, I think it saves each step generated_smiles in unordered set, https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/controlled_sampling/paccmann_gp/implementation.py#L242 and then it eventually takes the first X required elements from it , and this is why it's random

jannisborn commented 1 year ago

because the time it takes is not deterministic I think the output is not deteministic and generate different number of molecules each time.

The core.py is just the general wrapper for executing any GT4SD algorithm. It should not induce stochasticity. Have you tried replacing the SamplingSearch with a GreedySearch instead?

itaijj commented 1 year ago

I found out if ordered smiles_set has the same molecules, because we sample num_items from the unordered set we get randomness!

jannisborn commented 1 year ago

Interesting, have you tried adding a sorted() to the list returned here: https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/controlled_sampling/paccmann_gp/implementation.py#L270

itaijj commented 1 year ago

it fixes it, but I don't want to change the code

itaijj commented 1 year ago

I can bypass it by having same num_samples and maximum_number_of_sampling_steps and then sorting the generated molecules in my code

jannisborn commented 1 year ago

I think you're right in flagging this as a "bug". We shouldn't rely on those two parameters to be identical.

The problem with sorted(list(... is that it induces a bias when the whole list is longer than the final one. But this we can fix it:

mols = sorted(list([s for s in smiles_set if s]))
np.random.shuffle(mols)

If the seed is fixed, the shuffling will be deterministic. It would be great if you could submit a PR for this. Otherwise I will try to do it timely

itaijj commented 1 year ago

I actually see that in paccmann_rl you actually sort the molecules in set so there is no similar problem https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/conditional_generation/paccmann_rl/implementation.py#L140

how does it not induces a bias there?

jannisborn commented 1 year ago

It does induce a bias too, in this case the molecules are sorted in reverse order by length, so short molecules are ignored.

But note that this is a legacy code snippet that is not executed when running PaccMannRL. Good catch, actually, we can remove this function. The function that is executed is this: https://github.com/GT4SD/gt4sd-core/blob/ea9c299cea6ea6c63696b4e9ee49ceb16fb58507/src/gt4sd/algorithms/conditional_generation/paccmann_rl/implementation.py#L70 which is called from generate_batch(

jannisborn commented 1 year ago

Submitted a PR in #226 to fix this @itaijj