dart-lang / core

This repository is home to core Dart packages.
https://pub.dev/publishers/dart.dev
BSD 3-Clause "New" or "Revised" License
11 stars 2 forks source link

Weighted shuffle #608

Open SteveAlexander opened 4 years ago

SteveAlexander commented 4 years ago

I just implemented a weighted shuffle for a project I'm working on.

A weighted shuffle takes a sequence of items, and numeric weights for each item, and returns a randomly shuffled result sequence. Over a large number of iterations, the ordering of items in the results reflects the weights.

So weightedShuffle(['a', 'b', 'c'], [20, 10, 7]) will return ['a', 'b', 'c'] most of the time, but other orderings sometimes based on the weights given. In this case, over 37,000 runs, we'd expect to see 'a' in the first position about 20,000 times, 'b' in that position 10,000 times, and 'c' there around 7000 times.

https://gist.github.com/SteveAlexander/cb4484a7aedad7f1c33faa7be755bf76

It's a version of weighted random sampling where the size of the sample is always equal to the size of the pool, using the algorithm described here: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

And the Dart code is a port to Dart of jbochi's code at https://softwareengineering.stackexchange.com/a/344274

I didn't find an implementation already available in Dart.

Searching for existing algorithms for this, I came across a few instances of people looking for how to implement this, with often quite bad (incorrect / inefficient) algorithms proposed. e.g.

https://stackoverflow.com/questions/23971365/weighted-randomized-ordering https://softwareengineering.stackexchange.com/questions/233541/how-to-implement-a-weighted-shuffle https://stackoverflow.com/questions/28892020/randomly-shuffle-a-weighted-array

... and more.

So, I think it would be helpful to offer this in a well-known place.

If there's interest in including this functionality in this package, I'll work on a PR.

natebosch commented 4 years ago

This feels a little to specialized to me. I'm curious what @lrhn thinks.

lrhn commented 4 years ago

Being a little specialized is fine for package:collection. The package is there to make things available that some amount of people will eventually need, as opposed to the SDK where we assume that either everybody will eventually need the functionality, or some people will need it a lot. Being too specialized is bad, because then it's probably not going to be reused anyway, even by people with a roughly similar problem.

A weighted shuffle does makes some sense to me. I guess it's equivalent to choosing the first element at random with probability based on the weights, then choosing the next one from the rest, etc. This generalizes to a weighted subset or a single element as well. We'd probably want the selectWeightedSingle as a separate method, since it doesn't have to return an iterable. The full shuffle would just be making the count optional, an defaulting to every element (but that would make it impossible to make the [Random] argument optional independently of that).

The next question is whether there are efficient (enough) algorithms. I can see a quadratic one (every pick does random.nextInt(totalWeight) of the remaining elements, then runs through them linearly to see which one that number points to.) It might be possible to have a helper data-structure, and go to n×log(n), at the cost of much a higher constant factor.

As for API, either passing in two lists (one with elements, one with weights) or a single map from key to weight, can both work. I like the Map one, but the implementation is most likely going to copy the elements and keys into lists internally anyway. It's going to copy the original lists too, so that's not a big difference. The weight can be either integers or nums. Using integers are safer, they just work better with random.

sachaarbonel commented 3 years ago

Can be useful for knowledge/trivia-style quizzes where the weights could be a score correlated to the average accuracy (the less accurate you are on a question the higher the chance you'll get this question first next time)