Fix numpy seed for SnR random shuffle to reliably reproduce a map (or make the seed configurable)

EdwardBerman commented 1 month ago

np.random.seed(42) should do the trick, there may be a more modern approach. For example, if you put this in a file,

import numpy as np
np.random.seed(42)
array = np.array([1, 2, 3, 4, 5])
np.random.shuffle(array)
print(array)

and run it you will get the same result each time. Similarly, this approach will give you the same SnR map.

GeorgeVassilakis commented 1 week ago

You're right, a more modern approach was required. Instead of providing one seed for n maps (which would produce no variance), I pass a new seed iterated off +i in the loop to give each map a random, but reproducable, map. This can be seen in the new generate_multiple_shear_dfs() function here: https://github.com/GeorgeVassilakis/SMPy/blob/758b640f009f3e7f6555373bdceaa0e798399355/SMPy/utils.py#L227

Thanks for your help @EdwardBerman :)

EdwardBerman commented 5 days ago

Are you positive that you need to make several seeds? I don't think there's anything wrong with that, but I'm fairly certain you can create a reproducible sequence of random shuffles with just one seed.

GeorgeVassilakis commented 4 days ago

I believe so @EdwardBerman. As far as I can tell, If I passed one seed to the _shuffle_ra_dec() function, the list of galaxies would be shuffled with random.seed(seed=42) for every realization of num_shuffles (denoted i) shuffled maps, meaning that there would be 0 variance, and they'd get shuffled the same every time.

For `i` shuffled maps in the for loop on line 226: https://github.com/GeorgeVassilakis/SMPy/blob/45edeaccad2a5d0a60d0c1c97ce5a2f0e8b1491f/smpy/utils.py#L226

As it is currently implemented:

First shuffled map out of i maps gets passed a seed of 42, making a shuffled map with seed 42. Second shuffled map out of i maps gets passed a seed of 43, making a shuffled map with seed 43. etc.

If I just passed seed 42:

First shuffled map out of i maps gets passed a seed of 42, making a shuffled map with seed 42. Second shuffled map out of i maps also gets passed a seed of 42, making it identical to the first one. etc.

So, because _shuffle_ra_dec() gets called num_shuffles times, I believe it should need a new seed everytime, because it's independent of the map that was created before/after. This is how I've wrapped my brain around it, let me know if my reasoning makes sense to you. If not, let's discuss what's right further because I want this to be done properly!

GeorgeVassilakis commented 4 days ago

@EdwardBerman Sayan is telling me that you're right, so I'll check that out and probably remove the extra code if it's redundant. Standby!

EdwardBerman commented 4 days ago

Okay! We can discuss in person, but yeah, I agree with Sayan.

Consider this minimal example:

import random

seed = 42
random.seed(seed)

data = [1, 2, 3, 4, 5]

shuffles = []
for _ in range(3):  
    shuffled_data = data[:]
    random.shuffle(shuffled_data)
    shuffles.append(shuffled_data)

for i, shuffle in enumerate(shuffles, 1):
    print(f"Shuffle {i}: {shuffle}")

Every time I run this in some test.py, I get the same shuffle applied to each of the 3 arrays, but it's not the exact same shuffle for each. I get

Shuffle 1: [4, 2, 3, 5, 1]
Shuffle 2: [4, 3, 1, 5, 2]
Shuffle 3: [4, 2, 3, 1, 5]

and then one more time as a sanity check

Shuffle 1: [4, 2, 3, 5, 1]
Shuffle 2: [4, 3, 1, 5, 2]
Shuffle 3: [4, 2, 3, 1, 5]

See how shuffle 1 matches with shuffle 1 for both runs, same with 2 and 3, but shuffles 1 2 and 3 are not the same.

EdwardBerman commented 4 days ago

Potentially also make the seed an input to the config, so a user can convince themself that the SnR is not a fluke from one realization of sampling random shuffles.

GeorgeVassilakis / SMPy