std.random.RandomSample and RandomCover are poorly designed

dlangBugzillaToGithub commented 13 years ago

dlang-bugzilla (CyberShadow) reported this on 2011-12-05T04:00:22Z

Transfered from https://issues.dlang.org/show_bug.cgi?id=7067

CC List

alex (alexrp)
andrei (andralex)
b.helyer
bearophile_hugs
code (MartinNowak)
dmitry.olsh (DmitryOlshansky)
gassa
issues.dlang (jmdavis)
jens.k.mueller
joseph.wakeling
monarchdodra

Description

The following tests will always fail:

    int[] a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8 ];
    assert(!equal(randomCover(a, rndGen()), randomCover(a, rndGen())));
    assert(!equal(randomSample(a, 5, rndGen()), randomSample(a, 5, rndGen())));

The reason why these tests will fail is that both functions take the RNG by value. Not only is this unintuitive, this is also a security liability - someone depending on these functions to generate random identifiers can be surprised that two successive calls generate the same ID.

I strongly suggest that RNG types are disallowed from being implicitly copied. Even though pseudo-random number generators shouldn't be used for security purposes, they're still likely to be used in such contexts - especially considering lack of better sources of random data in Phobos.

dlangBugzillaToGithub commented 13 years ago

bearophile_hugs commented on 2011-12-05T04:09:15Z

See also issue 6593

dlangBugzillaToGithub commented 12 years ago

andrei commented on 2011-12-05T21:28:32Z

(In reply to comment #0)
> The following tests will always fail:
> 
>     int[] a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8 ];
>     assert(!equal(randomCover(a, rndGen()), randomCover(a, rndGen())));
>     assert(!equal(randomSample(a, 5, rndGen()), randomSample(a, 5, rndGen())));

Good point.

> The reason why these tests will fail is that both functions take the RNG by
> value.
>
> Not only is this unintuitive, this is also a security liability -
> someone depending on these functions to generate random identifiers can be
> surprised that two successive calls generate the same ID.

I think we can safely eliminate this argument from the discussion.

> I strongly suggest that RNG types are disallowed from being implicitly copied.
> Even though pseudo-random number generators shouldn't be used for security
> purposes, they're still likely to be used in such contexts - especially
> considering lack of better sources of random data in Phobos.

The problem with taking a random generator by reference is that it then needs to be escaped. So people would be quite surprised to see that:

auto gen = rndGen;
return randomSample(a, 5, gen);

has undefined behavior.

One way or another we need to solve this, e.g. by creating a wrapper with reference semantics over generators. Ideas are welcome.

dlangBugzillaToGithub commented 12 years ago

bearophile_hugs commented on 2011-12-05T23:53:48Z

(In reply to comment #2)

> The problem with taking a random generator by reference is that it then needs
> to be escaped. So people would be quite surprised to see that:
> 
> auto gen = rndGen;
> return randomSample(a, 5, gen);
> 
> has undefined behavior.
> 
> One way or another we need to solve this, e.g. by creating a wrapper with
> reference semantics over generators. Ideas are welcome.

Turn random generators into final classes?

dlangBugzillaToGithub commented 12 years ago

andrei commented on 2011-12-06T07:53:28Z

> Turn random generators into final classes?

We have backward compatibility to worry about.

dlangBugzillaToGithub commented 12 years ago

dlang-bugzilla commented on 2011-12-06T08:04:49Z

The disadvantages of breaking backwards compatibility need to be considered on a case-by-case basis. I think that turning RNGs into reference types has the potential to be a relatively low-impact change, while also having a good chance of revealing broken code.

The typical usage of std.random does not involve using the RNG types directly, and rather using the implicit thread-local RNG.

An example of affected code:

Random r;
// ... use r

I think that generally allowing such code is a mistake, because it's not clear that the RNG is not seeded.

auto r = Random(42);

We can implement this for backwards compatibility using static opCall (which shall be scheduled for deprecation).

The biggest problem is intentional usage of value semantics (it would transparently turn into reference semantics). Perhaps there's something we could do with the help of opAssign?

dlangBugzillaToGithub commented 12 years ago

bearophile_hugs commented on 2011-12-06T09:49:20Z

(In reply to comment #5)

> The biggest problem is intentional usage of value semantics (it would
> transparently turn into reference semantics).

I suggest to ignore such cases, they are probably uncommon, and add a warning note to the changelog.

dlangBugzillaToGithub commented 12 years ago

alex commented on 2011-12-07T04:21:37Z

(In reply to comment #4)
> > Turn random generators into final classes?
> 
> We have backward compatibility to worry about.

Sometimes breaking changes can be a good thing. It seems to me like with the current design, it's more likely that people are Doing It Wrong than the opposite. Keeping backwards compatibility also means allowing people to continue doing the same mistakes.

dlangBugzillaToGithub commented 12 years ago

b.helyer commented on 2011-12-08T00:02:29Z

To add an anecdote to this, when I first tried passing Random instances around, I was surprised to find they weren't reference types; it wasn't obvious to me at all.

dlangBugzillaToGithub commented 12 years ago

issues.dlang commented on 2012-06-15T01:02:22Z

We can make the random number generator ranges reference types without breaking code simply by creating inner structs for them which hold their member variables and sit on the heap. The only code which would break because of this would be code which relied on them being value types, which (as discussed) is almost certainly a bug.

dlangBugzillaToGithub commented 12 years ago

jens.k.mueller commented on 2012-06-19T07:30:43Z

(In reply to comment #2)
> (In reply to comment #0)
> > The following tests will always fail:
> > 
> >     int[] a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8 ];
> >     assert(!equal(randomCover(a, rndGen()), randomCover(a, rndGen())));
> >     assert(!equal(randomSample(a, 5, rndGen()), randomSample(a, 5, rndGen())));
> 
> Good point.
> 
> > The reason why these tests will fail is that both functions take the RNG by
> > value.
> >
> > Not only is this unintuitive, this is also a security liability -
> > someone depending on these functions to generate random identifiers can be
> > surprised that two successive calls generate the same ID.
> 
> I think we can safely eliminate this argument from the discussion.
> 
> > I strongly suggest that RNG types are disallowed from being implicitly copied.
> > Even though pseudo-random number generators shouldn't be used for security
> > purposes, they're still likely to be used in such contexts - especially
> > considering lack of better sources of random data in Phobos.
> 
> The problem with taking a random generator by reference is that it then needs
> to be escaped. So people would be quite surprised to see that:
> 
> auto gen = rndGen;
> return randomSample(a, 5, gen);
> 
> has undefined behavior.

Code like this is always incorrect, i.e. the problem is more general. I wonder whether a compiler can always flag those programs as invalid. Is it possible to solve the problem of escaping references to local variables in general?
If so it should probably be done that way. And a RNG should be made non-copyable to force pass by ref.
If it isn't possible (or inefficient or too time consuming, etc.), then std.random should be deprecated and std.random2 should replace it in the long run. I believe this is the best solution (but far from perfect) to handle design problems like this. I wish there was a better way to fix a wrong design decision. But working around (kind of against the language) is not future-proof.

dlangBugzillaToGithub commented 12 years ago

dmitry.olsh commented on 2012-06-19T07:40:01Z

> > auto gen = rndGen;
> > return randomSample(a, 5, gen);
> > 
> > has undefined behavior.
> 
> Code like this is always incorrect, i.e. the problem is more general. I wonder
> whether a compiler can always flag those programs as invalid. Is it possible to
> solve the problem of escaping references to local variables in general?
> If so it should probably be done that way. And a RNG should be made
> non-copyable to force pass by ref.
> If it isn't possible (or inefficient or too time consuming, etc.), then
> std.random should be deprecated and std.random2 should replace it in the long
> run. I believe this is the best solution (but far from perfect) to handle
> design problems like this. I wish there was a better way to fix a wrong design
> decision. But working around (kind of against the language) is not
> future-proof.

I had similar problems with redesign of std.regex. It has few less then ideal tradeoffs because of that but I've come to appreciate backwards compatibility.

In any case big value types like RNG that seldom get modified but often forwarded could use ref-counted COW. In D we are thread-local by default and thus COW is fun and cheap.

dlangBugzillaToGithub commented 12 years ago

bearophile_hugs commented on 2012-09-27T04:46:57Z

(In reply to comment #10)

> then
> std.random should be deprecated and std.random2 should replace it in the long
> run. I believe this is the best solution (but far from perfect) to handle
> design problems like this.

std.random2 seems a good solution.

dlangBugzillaToGithub commented 12 years ago

monarchdodra commented on 2012-09-27T13:51:58Z

(In reply to comment #12)
> (In reply to comment #10)
> 
> > then
> > std.random should be deprecated and std.random2 should replace it in the long
> > run. I believe this is the best solution (but far from perfect) to handle
> > design problems like this.
> 
> std.random2 seems a good solution.

As I've mentioned, I am almost ready with a non-breaking reference semantic fix. To avoid breaking code, the ranges are basically lazilly initialized.

As I've stated in the forums, I do not want to add ANY new functionality in random, which would have to be ditched for a migration to random2 anyways. My two ideas were:
* Explicit opSlice that returns an agressivilly initialized PRNG, for higher performance. They'd also have tighter safety.
* A PRNG.Payload alias for users that *really* want stack allocation.

Out of curiosity, if you had to write random2, what would you want in random2?
* I'd remove all constructors, and require an explicit call to either a seed function, or a free Make!PRNG fuction. I don't think classes is quite the right solution.
** This would address both the explicit initialization issue, as well as the "no-arg" initialization issue
* I'd also make them "only" input ranges. It doesn't make much sense to save a prng (though I'd still have .dup), but they wouldn't be forwardRanges.

Not much else actually.

I was planning on asking community feedback on potential evolutions when my developement was ready to go through (either evolving std.random, or forking to std.random2), but I wouldn't mind a pre-release opinion from users who have more hindsight of the issue ;)

PS: Related, I'd want std.container to require the exact same initialization scheme >:(

PPS: No spellchecker here, sorry.

dlangBugzillaToGithub commented 11 years ago

joseph.wakeling commented on 2013-06-20T16:02:07Z

I just want to put on record some remarks about short- and long-term solutions to this problem.

Long-term as I think we all agree we need RNGs to be reference types.  In this case the general design of ranges like RandomCover and RandomSample will be not entirely different from what RandomCover has now: the constructor will take a RNG as one of its arguments, and (as it's a reference type) this can be copied (by reference!) to an internal variable:

    struct SomeRandomRange(Rng)
    {
        private Rng _gen;
        ...
        this(/*stuff*/, Rng gen)
        {
            _gen = gen;
            ...
        }
        ...
    }

... and then you'd have some helper functions that would handle the case where the user doesn't specify an RNG by passing rndGen to the constructor.

However, while we're stuck with the situation we have, the design of RandomCover is statistically completely unsafe, as you'll inevitably get unwanted correlations due to storing the RNG by value.  As RandomCover's constructor insists on receiving an RNG, the only way to escape this is to pass it a new RNG seeded with unpredictableSeed.

What we can do, though, is to salvage things minimally by patching RandomCover to use the same technique as RandomSample -- that is, if the constructor gets passed an RNG, copy it; but if not -- call rndGen directly.  I'm going to prepare some patches to that effect, which will also try and do something about the .save methods of Random{Cover,Sample} -- both of which are currently broken.

It's putting a sticking plaster on a gaping wound, but at least it should mean that the simplest case available to the user -- doing something like this:

    auto c = randomCover(input);

... will be statistically reliable.

The caveat here is that we have to remember to tweak the constructors back to insisting on getting an RNG once we can guarantee that RNGs will behave as reference types.  (Is there any kind of template constraint that could be used to guarantee reference copying semantics?)

Anyway, I'll get on with writing the patches.

dlangBugzillaToGithub commented 11 years ago

bearophile_hugs commented on 2013-06-20T16:13:04Z

(In reply to comment #14)

> It's putting a sticking plaster on a gaping wound,

Why do you/we care so much for breaking backwards compatibility with something that is so broken? If you let it take the generator by ref I think you will remove from bugs from user code.

dlangBugzillaToGithub commented 11 years ago

joseph.wakeling commented on 2013-06-20T16:27:56Z

(In reply to comment #15)
> Why do you/we care so much for breaking backwards compatibility with something
> that is so broken? If you let it take the generator by ref I think you will
> remove from bugs from user code.

You can pass the generator to the constructor by ref, but you can't safely store it as a reference type, and so far as I can see you DO need to store it internally -- RandomCover is a range object, not a function where you can just use it and return.

_If_ you could safely store a reference to the RNG, then of course what you say would be right.  But you can't, not while they are value types.

My proposal won't break backwards compatibility (which is not in itself a good thing -- this back deserves to be broken:-) and won't fix the bigger picture, but it will improve the status quo without having to wait for the large-scale redesign that random number generation needs.

dlangBugzillaToGithub commented 11 years ago

joseph.wakeling commented on 2013-08-21T07:38:13Z

(In reply to comment #14)
> What we can do, though, is to salvage things minimally by patching RandomCover
> to use the same technique as RandomSample -- that is, if the constructor gets
> passed an RNG, copy it; but if not -- call rndGen directly.  I'm going to
> prepare some patches to that effect, which will also try and do something about
> the .save methods of Random{Cover,Sample} -- both of which are currently
> broken.

Just to note, I've submitted a pull request with updates along these lines:
https://github.com/D-Programming-Language/phobos/pull/1499

dlangBugzillaToGithub commented 11 years ago

github-bugzilla commented on 2013-08-29T00:34:46Z

Commit pushed to master at https://github.com/D-Programming-Language/phobos

https://github.com/D-Programming-Language/phobos/commit/8d9233cf8b9e4d27bd70dd0fcd171d2f6dc2f2c0
Partial fix for Issues 7067 and 10434 - std.random.RandomCover

The existing RandomCover design is fatally flawed because it
requires a RNG as input, which it then copies internally by
value.  So, unless the user is smart enough to pass something
like e.g. SomeRNG(unpredictableSeed), there will be unintended
correlations in random behaviour.

This partial fix follows the design of RandomSample in allowing
RandomCover to use the thread-global default RNG rndGen.  It
also improves the choice of template parameter and variable
names in line with Issue 10434.

dlangBugzillaToGithub commented 9 years ago

code commented on 2015-02-28T20:32:46Z

We had a talk about this during the 2nd D meetup in Berlin.
One remaining question was how to deal with memory management when using a reference type RNG.

I came up with a different solution and less invasive solution.

We simply disable implicit copying of RNGs (@disable this(this)) and ask people to use explicitly use .save when copying is indeed intended.

We can even deprecate postblit like so.

    deprecated("Please use .save to explicitly copy RNGs.")
    this(this) {}

Internally we'd need to change all copying to use std.algorithm.move.

If reference semantic is wanted people can use either
    new Random(unpredictableSeed)
or
    refCounted(rnd)
    refCounted(Random(unpredictableSeed))
or (unsafe)
    refRange(&rnd)
.

This should also work for combined random generators like.

new Exponential(Random(unpredictableSeed))

refCounted(Exponential(Random(unpredictableSeed)))

auto rng = Exponential(Random(unpredictableSeed));
refRange(&rng);

dlangBugzillaToGithub commented 9 years ago

joseph.wakeling commented on 2015-02-28T21:51:31Z

(In reply to Martin Nowak from comment #19)
> We had a talk about this during the 2nd D meetup in Berlin.
> One remaining question was how to deal with memory management when using a
> reference type RNG.
> 
> I came up with a different solution and less invasive solution.
> 
> We simply disable implicit copying of RNGs (@disable this(this)) and ask
> people to use explicitly use .save when copying is indeed intended.
> 
> We can even deprecate postblit like so.
> 
>     deprecated("Please use .save to explicitly copy RNGs.")
>     this(this) {}
> 
> Internally we'd need to change all copying to use std.algorithm.move.

Interesting thought.  We'd have to take care that phobos-side things don't get immediately "fixed" by people just using .save inside RandomCover and RandomSample constructors.

> If reference semantic is wanted people can use either
>     new Random(unpredictableSeed)
> or
>     refCounted(rnd)
>     refCounted(Random(unpredictableSeed))
> or (unsafe)
>     refRange(&rnd)
> .
> 
> This should also work for combined random generators like.
> 
> new Exponential(Random(unpredictableSeed))
> 
> refCounted(Exponential(Random(unpredictableSeed)))
> 
> auto rng = Exponential(Random(unpredictableSeed));
> refRange(&rng);

I guess what I don't like about this solution is that it requires the user to take responsibility for generating the RNG (or wrapper-of-RNG) as a reference type, when that ought to be the default behaviour.  But OTOH we could take advantage of the fact that RNG structs are templated, and rework their convenience aliases, e.g.:

    alias Mt19937 = RefCounted!(MersenneTwisterEngine!(uint, 32LU, 624LU, 397LU, 31LU, 2567483615u, 11LU, 7LU, 2636928640u, 15LU, 4022730752u, 18LU))

... which would still leave the underlying struct available for use by someone who really wants to use it directly in that form.

dlangBugzillaToGithub commented 9 years ago

code commented on 2015-03-02T22:40:58Z

(In reply to Joseph Rushton Wakeling from comment #20)
> I guess what I don't like about this solution is that it requires the user
> to take responsibility for generating the RNG (or wrapper-of-RNG) as a
> reference type, when that ought to be the default behaviour.  But OTOH we
> could take advantage of the fact that RNG structs are templated, and rework
> their convenience aliases, e.g.:
> 
>     alias Mt19937 = RefCounted!(MersenneTwisterEngine!(uint, 32LU, 624LU,
> 397LU, 31LU, 2567483615u, 11LU, 7LU, 2636928640u, 15LU, 4022730752u, 18LU))
> 
> ... which would still leave the underlying struct available for use by
> someone who really wants to use it directly in that form.

I don't think that sharing rng state among multiple consumers is a good idea.
Of course it looks like you want random numbers and shouldn't care about the order, but often fixed seeds are used to get a reproducible pseudo-random range.
When you share such a RNG as in
    auto rng = Random(fixedSeed);
    assert(!equal(randomCover(a, rng), randomCover(a, rng)));
the sequences depends on the implementation of equal and randomCover.
Would be cleaner to use 2 RNGs with independent state here, so the fact that a  shared state requires to be explicit is a good thing. Also we'd set a bad precedent by making one of std's ranges a ref type. Ref ranges have a lot of subtleties and aren't that well tested with std.algorithm.

dlangBugzillaToGithub commented 9 years ago

jens.k.mueller commented on 2015-03-03T15:24:32Z

We should try out Martin's idea.
I'll do it but I'd like to write some tests first. Joseph you mentioned several "suprizes" with the current design. I'd like to create some tests for those. Can you give some code for the flaws in the design?

I'm also looking at http://en.cppreference.com/w/cpp/numeric/random
But adding more generators (including abstractions for non-deterministic) and distributions can be done after this issue has been addressed.

dlangBugzillaToGithub commented 9 years ago

joseph.wakeling commented on 2015-03-12T21:48:43Z

@Martin @Jens: sorry for radio silence on this.  It's a busy period, and I recently moved to a new apartment where I still don't have home internet.

> I don't think that sharing rng state among multiple consumers is a good idea.
> Of course it looks like you want random numbers and shouldn't care about the
> order, but often fixed seeds are used to get a reproducible pseudo-random
> range.
>
> When you share such a RNG as in
>    auto rng = Random(fixedSeed);
>    assert(!equal(randomCover(a, rng), randomCover(a, rng)));
> the sequences depends on the implementation of equal and randomCover.

Yes, a case like this obviously creates complications.  But I don't think subtleties like this should prevent a user from instantiating one RNG instance and using it with multiple consumers.  The "default" case (where no RNG instance is passed) already employs a common RNG instance between different consumers; I'd rather have consistency of behaviour whether an RNG is explicitly provided or not.

Of course, we can advise associating different RNG instances with different consumers (or making sure that a consumer does its work before associating an RNG with a new consumer), but that's different from compelling the user to follow this pattern.

Note also that one-RNG-per-consumer really doesn't scale to e.g. a simulation where you are generating thousands or millions of random samples.  The risk of the different RNG seeds generating correlated sequences seems greater than the risks associated with using one RNG underlying all sample instances.

> Also we'd set a bad precedent by making one of std's ranges a ref type. Ref ranges have a lot of subtleties and aren't that well tested with std.algorithm.

I agree there are a lot of subtleties, but my feeling is that we need to embrace and explore those in order to identify what the best way forward is for RNGs.

dlangBugzillaToGithub / migration_test