SeldonIO / alibi-detect

Algorithms for outlier, adversarial and drift detection
https://docs.seldon.io/projects/alibi-detect/en/stable/
Other
2.26k stars 225 forks source link

Spot the diff drift detection- inconsistent results while multiple run #390

Open gunja8 opened 3 years ago

gunja8 commented 3 years ago

Is there a way to stabilise the results of the algorithm spot the diff drift detection? In each run with same configuration and data the results of diff and p values are different.

arnaudvl commented 3 years ago

Hi @gunja8 . You are right that the results might be slightly different as the random seeds are not fixed everywhere in the spot-the-diff detector (e.g. here). We will have a look into resolving this.

arnaudvl commented 3 years ago

Could you actually verify that it isn't solved by simply setting np.random.seed() in your runtime?

ascillitoe commented 3 years ago

*setting np.random.seed(seed) where seed is an int of your choosing.

We might to want to keep this issue opening regardless as could probably do with thinking about replacing all the randomized operations in alibi-detect with the newer np.random.Generator (perhaps with a detector kwarg to set the seed).

Related to https://github.com/SeldonIO/alibi/issues/209.

gunja8 commented 3 years ago

Thanks @arnaudvl and @ascillitoe for the suggestions, I have incorporated np.random.seed() while inputing initial_diffs. n_diff = 1 np.random.seed(0) initial_diff_mine = np.random.normal(size=(n_diff,) + x_ref.shape[1:]) * x_ref.std(0)

But still the results are inconsistent, i.e for each run there is a slight deviation of the output.

Please can you suggest any other change/update in the code which could solve the issue.

ascillitoe commented 3 years ago

Hi @gunja8 I've double-checked this and it looks like there is a minor issue here as you said. I've run some tests on cd_spot_the_diff_mnist_wine.ipynb example notebook, and tried to set all the important random seeds e.g.

# Python std lib random seed
random.seed(0)
# Numpy, tensorflow
np.random.seed(0)
tf.random.set_seed(0)
# Additional seeds potentially required when using a gpu 
# (see https://www.youtube.com/watch?v=TB07_mUMt0U&t=1804s)
os.environ['TF_CUDNN_DETERMINISTIC'] = 'true'
os.environ['TF_DETERMINISTIC_OPS'] = 'true'
os.environ['PYTHONHASHSEED']=str(0)

Even with all the above we still don't get deterministic results between predict calls, but curiously I do get deterministic results between notebook runs. e.g. in cell 4 of the notebook I always get p-value: 6.911625415497573e-09 when predict is called for the first time, and p-value: 7.490401031041775e-08 it is called for the second etc.

@RobertSamoilescu @arnaudvl any thoughts on this? I wonder if retrain_from_scratch is working as expected? i.e. is tf.keras.models.clone_model introducing some randomness?

arnaudvl commented 3 years ago

@ascillitoe I believe tf.keras.clone_model creates a model with newly initialized weights (docs). Could this be the cause or are you also controlling the random seed for this behaviour?

Joshwlks commented 1 year ago

I am facing this issue and it looks really bad. The top feature for importance in my first run has diff score=0.438, and in the second run it falls to 36th place out of 39 with diff score=0.0000216. How can we trust in the diff scores if they have this much variation from one run to the next, or am I missing something? Alibi version: 0.11.0

vinyasHarish95 commented 1 year ago

Hey Seldon team, thanks for all the work on this great package!

I only get reproducible results when I set the np.random.seed(my_seed) within the same code block as the instantiation of the detector (I'm using a Jupyter Notebook). While the diff graph looks slightly different between whether or not the backend is set to pytorch or tensorflow, the desired behavior once the seed has been set is consistent between the two backends.

alibi-detect v0.11.1 tensorflow 2.9.1 pytorch 1.12.1

ojcobb commented 1 year ago

Hi @Joshwlks and @vinyasHarish95,

I think there's some confusion here arising from a slight misinterpretation of SpotTheDiff's output.

It should not be interpreted as a unique description of the difference between the underlying distributions.

They should be considered 'landmarks' against which the detector compared both reference and test images in order to identify differences between the distributions. Consider the simplest case of a single landmark (n_diffs=1). If the reference instances are significantly closer to the landmark on average than the test instances, then there must be a corresponding difference between the underlying distributions.

In much the same way as a classifiers can classify instances in many different ways, the SpotTheDiff detector can identify differences between distributions using different landmarks. Nonetheless, when a detection is made it can still be useful to be shown exactly which landmarks were used. The interpretation is then that the returned landmarks are sufficient for confirming a difference between the distributions -- but it doesn't give a complete or unique characterisation. Note that a full characterisation of the difference between two multivariate distributions would not be interpretable.

Hope that helps explain why different runs can produce different outputs. We'll consider whether we can improve the docs to make this clearer.

[Regarding seeding the computations for reproducibility, this is something we're aware of and working on. See #250.]