abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
266 stars 18 forks source link

Respect random seed, make randomness deterministic #83

Closed philieeas closed 4 months ago

philieeas commented 4 months ago

Hello, first of all this library is great and a joy to work with.

I am using it to generate some random test values, but I am not sure how to make the random outputs deterministic by a seed. What I tried is setting the random seed using pl.set_random_seed, but Basically I except a and b to be the same in the following example:

import polars as pl
import polars_ds as pld

print("polars version:", pl.__version__)
print("polars-ds version:", pld.__version__)
pl.set_random_seed(1)
a = pl.select(x=pl.arange(100) / pl.lit(100).alias("x")).select(y=pl.col("x").stats.sample_normal(std=1.0))
pl.set_random_seed(1)
b = pl.select(x=pl.arange(100) / pl.lit(100).alias("x")).select(y=pl.col("x").stats.sample_normal(std=1.0))
print(a)
print(b)
assert a.equals(b)

It gives me this output, the values differ:

polars version: 0.20.8
polars-ds version: 0.3.2
shape: (100, 1)
┌───────────┐
│ y         │
│ ---       │
│ f64       │
╞═══════════╡
│ 0.035035  │
│ -0.312872 │
│ 1.363257  │
│ 0.270373  │
│ -1.243942 │
│ …         │
│ 0.077214  │
│ -0.1047   │
│ -0.34536  │
│ 1.370966  │
│ -0.05009  │
└───────────┘
shape: (100, 1)
┌───────────┐
│ y         │
│ ---       │
│ f64       │
╞═══════════╡
│ 0.749197  │
│ 0.75136   │
│ 0.66185   │
│ 0.518991  │
│ -1.892289 │
│ …         │
│ -0.253558 │
│ -0.382351 │
│ 1.605878  │
│ 2.428654  │
│ 1.203619  │
└───────────┘
Traceback (most recent call last):
  File "expl.py", line 12, in <module>
    assert a.equals(b)
AssertionError

I am not sure how to fix this. Maybe there is a way to reset the random number generator from polars or you are using your own? Is this use case actually in scope? I can imagine it being difficult with the parallelization that polars does in the background.

abstractqqq commented 4 months ago

I played with the code a little bit and it seems like a seeded random generator doesn't break any code.. (from very limited checks). I will give it a try after finishing up some matrix profile work..

Disclaimer: not an expert on seeding and random numbers... But let's see how much we can get away with by just switching the rng generator to a seeded one..

abstractqqq commented 4 months ago

https://github.com/abstractqqq/polars_ds_extension/commit/0a0e456030b30cbd2be48c9f4207140195d42748