LSSTDESC / rail_attic

Redshift Assessment Infrastructure Layers
MIT License
14 stars 9 forks source link

Addressing the last few remaining places where randoms aren't using default_rng and/or a configurable seed #354

Closed sschmidt23 closed 1 year ago

sschmidt23 commented 1 year ago

addressing #41, I looked for instances where we were not using numpy.random.default_rng, and/or were not setting the seed from a config parameter for the stage. With all stages that use a random number generator, using default_rng and a configurable seed should allow us to isolate the effects of the rng to the particular stage, and easily change via the config parameter to test effects of the randoms on stage performance.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (82da866) 100.00% compared to head (6e4ccc4) 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #354 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 38 38 Lines 2502 2582 +80 ========================================= + Hits 2502 2582 +80 ``` | Flag | Coverage Δ | | |---|---|---| | unittests | `100.00% <100.00%> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://app.codecov.io/gh/LSSTDESC/RAIL/pull/354?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC) | Coverage Δ | | |---|---|---| | [src/rail/creation/degradation/grid\_selection.py](https://app.codecov.io/gh/LSSTDESC/RAIL/pull/354?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC#diff-c3JjL3JhaWwvY3JlYXRpb24vZGVncmFkYXRpb24vZ3JpZF9zZWxlY3Rpb24ucHk=) | `100.00% <100.00%> (ø)` | | | [src/rail/estimation/algos/knnpz.py](https://app.codecov.io/gh/LSSTDESC/RAIL/pull/354?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC#diff-c3JjL3JhaWwvZXN0aW1hdGlvbi9hbGdvcy9rbm5wei5weQ==) | `100.00% <100.00%> (ø)` | | | [src/rail/estimation/algos/randomPZ.py](https://app.codecov.io/gh/LSSTDESC/RAIL/pull/354?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC#diff-c3JjL3JhaWwvZXN0aW1hdGlvbi9hbGdvcy9yYW5kb21QWi5weQ==) | `100.00% <100.00%> (ø)` | | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/LSSTDESC/RAIL/pull/354/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

eacharles commented 1 year ago

Maybe it makes sets to use the original seed + the chunk number, so each chunk gets a different seed, but in a deterministic way.

Or am I missing something?

-e

On Apr 28, 2023, at 3:36 PM, Drew Oldag @.***> wrote:

@drewoldag approved this pull request.

This looks ok to me. Thank for taking care of it.

In src/rail/estimation/algos/randomPZ.py https://github.com/LSSTDESC/RAIL/pull/354#discussion_r1180858550:

@@ -35,7 +36,8 @@ def _process_chunk(self, start, end, data, first): pdf = []

allow for either format for now

numzs = len(data[self.config.column_name])

  • zmode = np.round(np.random.uniform(0.0, self.config.rand_zmax, numzs), 3)
  • rng = np.random.default_rng(seed=self.config.seed) This is probably fine, but it stood out a little. The goal here is that every time _process_chunk is called, it would use the same random value, not a new random value for each call to _process_chunk, correct?

In src/rail/estimation/algos/knnpz.py https://github.com/LSSTDESC/RAIL/pull/354#discussion_r1180859269:

@@ -106,8 +106,8 @@ def run(self): trainszs = np.array(training_data[self.config.redshift_column_name]) colordata = _computecolordata(knndf, self.config.ref_column_name, self.config.column_names) nobs = colordata.shape[0]

  • rng = np.random.default_rng
  • perm = rng().permutation(nobs)
  • rng = np.random.default_rng(seed=self.config.seed) No need to update, but this file is using 0 as the default seed - just seems like we could do better 🤷

— Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/RAIL/pull/354#pullrequestreview-1406664324, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRIGIQKHOM7XC7XKPAGCALXDRBADANCNFSM6AAAAAAXPVAKIY. You are receiving this because you are subscribed to this thread.

sschmidt23 commented 1 year ago

I hadn't even thought of the chunk number, that's a good point, Eric. I think you're right, in our current setup, if we have the numpy.random.default_rng(seed=self.config.seed) in the process_chunk function, then it will reset to the same seed at the start of each chunk, setting the seed to seed + chunknum fixes that. I guess the other option would be to set the rng in the init as self.rng so that it's not reset at each call of process_chunk but rather only during the init.

sschmidt23 commented 1 year ago

Checking through things in RAIL, the only parallelized estimator that uses an rng in _process_chunk is randomPZ, I changed the seed initialization to seed = self.config.seed + start. All other uses of a random number generator are in non-parallelized functions, and thus do not need this addition. However, any future parallelizations that have a random number used in a chunked function will have to similarly initialize (e.g. the open PR on somocluSOM, I'll push a change to that branch now as well).

I'll look through GPz_v1, FlexZBoost, Delight, and BPZ_Lite now to see if I need to make any changes on those repos.