MDAnalysis / mdaencore

Ensemble overlap comparison software for molecular data.
http://www.mdanalysis.org/mdaencore/
GNU General Public License v2.0
0 stars 0 forks source link

Encore uses weak random numbers #37

Open kain88-de opened 7 years ago

kain88-de commented 7 years ago

In ap.c the normal rand function of the c standard lib is called. In general this is a weak Linear congruential generator that is not thread save. With weak here I mean that it has a short period and the resulting numbers are correlated. Another issue is thread safety. This is a problem when the CAffinityPropagation function is called from separate python processes because we can't independently seed the RNG for each call of the function. I'm not sure how often the addition of noise is used in practice but I would recommend to remove the addition of noise in ap.c and rather do that in affinityprop.pyx with it's own numpy.random.RandomState. To allow using unique random states I would suggest the following code from scikit-learn.

# copy-pasted from scikit-learn utils/validation.py
def check_random_state(seed):
    """Turn seed into a np.random.RandomState instance

    If seed is None (or np.random), return the RandomState singleton used
    by np.random.
    If seed is an int, return a new RandomState instance seeded with seed.
    If seed is already a RandomState instance, return it.
    Otherwise raise ValueError.
    """
    if seed is None or seed is np.random:
        return np.random.mtrand._rand
    if isinstance(seed, (numbers.Integral, np.integer)):
        return np.random.RandomState(seed)
    if isinstance(seed, np.random.RandomState):
        return seed
    raise ValueError('%r cannot be used to seed a numpy.random.RandomState'
                     ' instance' % seed)

The noise addtion in AffinityPropagation.run should then be

def run(...., noise=False, seed=None):
    if noise:
        rng = check_random_state(seed)
        matndarray += matndarray *  rng.uniform(0, 1e-16, size=matndarray.shape)

This will allow to pass a separate seed for each python process ensuring that the random numbers are not correlated. Additionally the numpy uses the Mersenne Twister, a widely accepted RNG for scientific applications.

@mtiberti if it doesn't matter that the added noise might be correlated it would be nice if you can still add a comment in the C code.

mtiberti commented 7 years ago

Thanks for this. Since noise is added to remove degeneracies from the similarity matrix it shouldn't be a problem if it's slightly correlated - a potentially problematic case would be when the same noise is added to elements of the similarity matrix which are identical, which I think is unlikely. However since the change you're proposing is not very time-consuming and still adds to the robustness of the code, I'm going to implement it. The only problem is that I'm about to leave for holidays and I'll be back on the 26th of July and I'll unlikely be able to work on it before then

mtiberti commented 1 year ago

a temporary solution would be the ability to pass a different random seed to each independent process to ensure that they are uncorrelated, as suggested in #40. The quality of RNG is not very critical for this application