Pass a seed to randn1d - Githubissues

0todd0000 / rft1d

One-Dimensional Random Field Theory in Python

https://spm1d.org/rft1d/

GNU General Public License v3.0

3 stars 3 forks source link

Pass a seed to randn1d #6

Closed lele394 closed 8 months ago

lele394 commented 8 months ago

Hello,

I've recently had to use the rft1d library to construct gaussian random fields. I needed to perform a sweep on the smoothing parameter to build a heatmap of my fields for different values. Since I couldn't pass a seed to the generator I couldn't do it. I did however modify the library by duplicating the random file and modifying both generators. I added a seed parameter to the constructor, and simply set that seed at the beginning of each generate_sample().

Modified generator __init__.

def __init__(self, nResponses=1, nodes=101, FWHM=10, pad=False, seed=1234):
    super(Generator1D, self).__init__()

    self.seed = seed

Modified start of generate_sample()

def generate_sample(self):
    np.random.seed(seed=self.seed)

Are you interested in me pushing that change?

0todd0000 commented 8 months ago

Thank you for this post.

Quick reply: Adding a seed argument might be OK provided there is a compelling case against using np.random.seed itself.

Details: The random number generators in rft1d are meant to be controlled using np.random.seed like this:

import numpy as np
import rft1d

np.random.seed(0)
a = rft1d.randn1d(5, 101, 25)
b = rft1d.randn1d(5, 101, 25)  # different from "a"
np.random.seed(0)
c = rft1d.randn1d(5, 101, 25)  # same as "a"

Using an extra seed parameter as you suggest is OK, but I am a bit reluctant to do this because it suggests...

... that rft1d has its own RNG seed control (which it doesn't)
... that this seed is somehow different than np.random.seed (which it isn't)

As a counter-example, imagine this:


np.random.seed(0)
x = np.random.randn(10)

np.random.seed(0)
a = rft1d.randn1d(5, 101, 25, seed=1234)
y = np.random.randn(10)  # different from "x"

That is, the user will need to keep track of BOTH np.random.seed calls AND seed values submitted to rft1d.randn1d, and the new seed parameter could render external calls to np.random.seed meaningless.

I think that this type of use-case suggests that RNG seeding should not be handled inside rft1d, but please feel free to provide an example or two of how a seed keyword argument could be useful beyond simply calls to np.random.seed

lele394 commented 8 months ago

Hello, Thanks for your reply. I did try using numpy's random.seed() without any success. The seed was always new when using randn1d. My use case is really about reproducibility of the results. It may be an issue on my part, and I haven't invested much time looking into it as it was just a prototype. I'll be back on it thursday and will take a better look at the way I implemented it.

My solution wasn't to add new parameters to randn1d, but to add a different file from random that could be called something like SetRandom. I basically just duplicated the file and modified it to my liking. Another solution could be to add a generate_seeded_sample which would have a seedparameter defaulted to None. And a check when generating samples, using that new function. That would solve the problem of keeping track of 2 seeds, while allowing the user to pass the seed he wants to use directly as an argument.

I actually wasn't aware that random generation was controlled using numpy. I did take a look in the wiki without finding any mention of it. I figured that out by looking through the code. No matter what way you'd rather go with, I'd recommend adding a small section describing how to introduce reproducibility when using the library.

Hope that helps, Léo

0todd0000 commented 8 months ago

I'd recommend adding a small section describing how to introduce reproducibility when using the library.

I agree. I will add a reproducibility example to the online documentation. Before I do that I'd like to try to resolve this issue to ensure that rft1d covers intended reproducibility use cases...

I understand your idea regarding generate_seeded_sample, but I don't think this solves the problem given the counter-example above unless generate_seeded_sample gets then sets the RNG state. However, I am struggling to think of an example where this might be necessary.

Can you please provide a code snippet that demonstrates why generate_seeded_sample might be useful beyond use of np.random.seed?

Or do you think it would be sufficient to add a reproducibility example to the documentation?

lele394 commented 8 months ago

Sorry, due to the project I'm working on, I can't share any code snippet right now. I can however give you an overall view of what I need that for. I'm basically creating training datasets for a neural network. The first reason why I needed this feature was to do a parameter sweep to build a "map" of the "zoom" of the impact of the parameter. Sadly, I need to set the seed before every run to have consistent result. Using that map I can then define parameters ranges for my program. I basically use multiple random gaussian fields the same way you'd use perlin noise octaves to create terrain height maps, but in 1D.

The second is that I'd like to study the impacts of other parameters of my program on the output of said neural networks while not changing the others (including the RGFs). Due to the size of the data we're talking about, let's just say that saving it to the hard drive is not really feasible, as it will also be shared with other people. I'd like to avoid having to send multiple gigs of data, and would rather ship a script able to generate my dataset.

My solution using generate_seeded_sample is actually irrelevant now, I may not have implemented my seed the right way. As I said it was nothing more than a prototype I pieced together as a proof of concept. Adding a reproducibility example to the documentation will most likely be sufficient.

By the way I have spotted weird abnormalities when plotting the "map" I'm talking about above. For complete transparency I don't really understand all the maths behind randn1d. But I did spot an oddity, there's something like 3 "bands" when sweeping the smooth parameter, before it goes completely badonkers (see linked image). I'll open a separate issue when I'll get to it since it has nothing to do with reproducibility. plot axes are not to size and not in the linear scale. I can spot 3 bands though ( 0-150, 150-350, 350-550). Is that an expected behavior?

0todd0000 commented 8 months ago

I don't quit understand what the horizontal and vertical axes represent in the map above so please do indeed open a separate issue with a description of the axes, and preferably also with a colormap.

Back to the random seed issue:

From your description it sounds like the problem can be solved just by calling np.random.seed before each call to rft1d.random functions, something like this:

for i in range(1000):
    np.random.seed(i)
    a = rft1d.randn1d(8, 101, 25)

Does this adequately describe your use case?

Does this give you the seeding control you need?

lele394 commented 8 months ago

Does this adequately describe your use case?

Not quite, see below.

for i in range(1000):
    np.random.seed(1234)
    a = rft1d.randn1d(1, 1000, i)

I sweep the smoothing parameter, not the seed. It's not relevant to our issue though.

Does this give you the seeding control you need?

Yes that works. The error was on my part. That issue is solved on my end. I believe once you get a reproducibility example in the documentation, we'll be able to close that thread. Thanks a lot for your answers.

0todd0000 commented 8 months ago

OK, thank you for confirming!

I see what you mean now by sweeping the smoothness parameter. Although not directly related to this issue please note the following:

It becomes numerically difficult to accurately simulate standard-normal Gaussian fields when smoothness gets too high; ratios greater than about 0.5 (where the ratio is: fwhm / field_size) may not yield accurate fields. So if you use a field size of 1000 as in your example you should probably stop at around i=500
You may want to set pad=False like this:

a = rft1d.randn1d(1, 1000, 300, pad=False)

Setting pad=False like this achieves two things:

It produces circular fields, where the fields can be stacked like this:

a = rft1d.randn1d(1, 1000, 300)
b = np.hstack([a,a,a,a])  # repeating random field

It may solve the irregularities in your heat map because there is no padding. rft1d pads the random fields by default because this produces a more random (non-repeating) result. However, the amount of padding is a function of the smoothness parameter, so you may see odd results when sweeping smoothness. In other words:

np.random.seed(123)
a0 = rft1d.randn1d(1, 1000, 50)

np.random.seed(123)
a1 = rft1d.randn1d(1, 1000, 300)  # not directly comparable to "a0" because fields are padded

np.random.seed(123)
b0 = rft1d.randn1d(1, 1000, 50, pad=False)

np.random.seed(123)
b1 = rft1d.randn1d(1, 1000, 300, pad=False)  # "b1" and "b0" are more directly comparable than are "a1" and "a0"

0todd0000 commented 8 months ago

I have added a reproducibility example here and made this example accessible from the main examples menu. Do you think this is now clearer?

lele394 commented 8 months ago

Yes, what you added is exactly what I was missing. Reproducibility and the right way to do so is now correctly documented, and should make it clear for future users.

Looks like we're done here, I'll close the issue.