FS-1135 - Random functions for collections

Lanayx commented 1 year ago

Click “Files changed” → “⋯” → “View file” for the rendered RFC (i.e.: here).

Discussion: in this thread

T-Gro commented 1 year ago

The APIs which can fail on non-null values ( choice,choices,sample with collections that are emty/too small) -> I would favour having a try* variant of them, to avoid developers doing their own wrappers to prevent exceptions.
- A try- alternative is in line with existing collections functions
(personally, this can also wait for an impl PR, but for me it is an important detail): How do you plan to work with "Shared thread-safe Random instance ". Locking of a single global System.Random instance for the duration of entire operation? Or having multiple instances, either per-thread or in a pool?
Since games and simulations were mentioned, it would be good to thing of high perf scenarios:
- Just like it exists with sorting, could the Array.shuffle have an accompanying Array.shuffleInPlace to reduce copies?
- Especially with a more secure random generator, it could be handy to pre-generate the random indexes for shuffle in parallel (to live in Array.Parallel module)
The "with Random parameter" variants of the API : Does it only allow System.Random, or is https://learn.microsoft.com/en-us/dotnet/api/system.security.cryptography.randomnumbergenerator?view=net-7.0 considered? (an abstraction covering both regular System.Random as well as RandonNumberGenerator could be requiring a random generating function as an argument).

Lanayx commented 1 year ago

@T-Gro Hi, thank you for the review 1) Not all of collection functions have try variant, only the ones where it makes sense. For example: max, average, reduce - just throw and that is ok for me. If function implies that collection is not empty, it is ok to throw, and choice,choices,sample imply that. 2) The second option, you can see my implementation here https://github.com/Lanayx/fsharp/blob/master/src/FSharp.Core/Random.fs 3) Agree, inplace shuffling can be useful, I'll add that 4) System.Random is not sealed and it doesn't implement RandomNumberGenerator, so if someone wants use the latter he can inherit System.Random and implement virtual methods (BTW System.Random.Shared does exactly this)

dsyme commented 1 year ago

I'm concerned about naming being confusing when colliding with existing functions, esp. List.choices v. List.choose. Perhaps
- List.randomShuffle
- List.randomShuffleWith
- List.randomChoices
- List.randomChoicesWith
- List.randomSample
- List.randomSampleWith
It feels like the use of these functions is rare enough that adding "random" does help clarity here without too much cost.
Please list naming of equivalent python functions
Can we get any stats on use of equivalent python functions
I'm concerned there's no weighted choice option (as mentioned in the unresolved questions)
I'm concerned there are no functions to choose a single element.
I'm bit concerned by having both "randomChoices" and "randomSample" without the naming aligning, but let's first take a look at the Python naming, also perhaps other languages.

dsyme commented 1 year ago

p.s. great to see progress on this

Lanayx commented 1 year ago

@dsyme thank you for the comments 1) I don't like the random prefix, it makes me feel like those functions shouldn't belong there at all, while for example List.shuffle feels nice like it has always been there. If you are strongly against extending List module - we can consider creating a separate module Random (which will be static class), so the code will look like

[1; 2; 3] |> Random.shuffle
[|1; 2; 3|] |> Random.shuffle

This will allow us to use overloads, but prevent from using partial application. On one hand it is ok, on the other - it is not what the other collection functions look like, so it's not that consistent.

2) Passing Random seems to fit better than function, since I think we should make it easy to pass new Random.Shared around. Still using shuffleWith instead of shuffleRand feels good with me. 3) Idk about python functions usage, but I'm sure that every machine learning program should include random shuffling of the data between several epochs.

4) Python: shuffle, choice, choices, sample (currently all names are taken from there). Java, Scala: shuffle only. Rust: choose, choose_multiple, choose_weighted, shuffle, partial_shuffle, choose_multiple_fill, choose_multiple_weighted. underscorejs: shuffle, sample

5) choice is the function to get single element

6) As for weights and counts parameters - my feeling is that they are more specific to data science, so if somebody needs them - they can contribute later, so the the functions will be for example List.choiceWeights and List.sampleCounts

7) I understand concern about choose and choice wording, but that's a minor issue for me. There are much bigger intersections in F# with .Net base types, like Array or String being a module and a class, so functions are intersected with .net methods. Or having ...Async methods in different libraries returning Task and not Async etc

vzarytovskii commented 1 year ago

This looks fine to me overall.

I don't think we need a separate Random module for these functions, it will be a bit confusing on how it's related to collections.

As for naming - I don't have a strong opinion on that, since I don't have much experience in using those functions, but I share @dsyme's concern about naming - once they're in, there's no way for us changing or removing them in case they're confusing or lack some functionality.

I personally think, that the safest way is to align naming and signatures to what existing functions in .NET and Python have.

Lanayx commented 1 year ago

@dsyme I created a voting in the discussion and it seems that people like the idea of prefix, so I changed all the names in PR

Lanayx commented 1 year ago

Due to the comment I've changed prefixes to submodule, since such design is already implemented in Array.Parallel

dsyme commented 1 year ago

@Lanayx Sorry for the churn, but I don't think the module is the right way to go - could you revert to the prefix please?

Thanks!

abelbraaksma commented 2 months ago

After we've resolved the unresolved question, I'll give this RFC one more quick review and then I think I can merge it for you.

abelbraaksma commented 1 month ago

@dsyme and @vzarytovskii, this proposal is ready. If you can give it your thumbs up, I can merge it (or you can ;) ).

abelbraaksma commented 1 month ago

/cc reminder for tomorrow: @vzarytovskii

dsyme commented 1 month ago

Merged! Thank you everyone for your hard work on this

KevinRansom commented 1 month ago

@dsyme -- please consider the revoking the approval for this suggestion.

FSharp.Core is already too big, and so far all attempts to refactor it to make it smaller have failed. Useful helper functions such as these belong in an external library where they can be selected by a developer who is working on scenarios where this type of functionality is useful.

We currently ship fsharp.core as a nuget package embedded within the dotnet sdk:

The current FSharp.Core.dll is currently 3MB and it looks like when nullability is supported it will bloat by perhaps an additional .5 MB, because we will have to duplicate the metadata stream for nullable aware compilers and legacy compilers. From time to time we get pressure from the SDK team to become an optional install feature because of our size and relatively small user base. I think whenever they hear a customer complaining that VMs for the dotnet SDK have to be larger than for other cloud platforms and languages.

In general we have only added APIs to FSharp.Core when they were widely applicable and supported an idiomatic programming style, or interop with C# assemblies of course some that don't match these criteria have slipped in.

Sure these APIs are small and compact, however, so are the many other APIs we could add, these suggested APIs deserve to live in a support library rather than FSharp.Core, someone should work on it. Because we have many gaps with the Numpy data library and FSharp.Core is not the place to address them.

smoothdeveloper commented 1 month ago

FSharp.Stats?

get pressure from the SDK team to become an optional install feature because of our size and relatively small user base

It is not because F# is "small user base" (and that this line of thinking can lead to eradication of it) that it shouldn't be part of the SDK, unless we must make the CLR a C# only thing OR the SDK would come with no compilers, just the assemblies, msbuild stuff and dotnet tool. This also would save space in the SDK...

In anycase, I concur with your analysis, and also increasing the API surface for the compiler team, however useful, there are several packages that already extend FSharp.Core (https://fsprojects.github.io/FSharp.Collections.ParallelSeq/).

We should keep the fslang-design process though for those extensions though.

Lanayx commented 1 month ago

Because we have many gaps with the Numpy data library and FSharp.Core is not the place to address them.

Random functions are not part of Numpy, they are part of Python standard library, since they are very basic and widely applicable and were really "missing" in F# since inception https://docs.python.org/3/library/random.html

krauthaufen commented 1 month ago

I too think that bloated libraries are hard to maintain and refactor since i have experience with it sadly. Maybe this would be a good starting point for a second nuget package? Could live in this repo and just produce something called FSharp.Collections maybe? I personally think that randomozing collections is somewhat oddly specific for a standard library. Just my thoughts

Lanayx commented 1 month ago

bloated libraries are hard to maintain and refactor

Fully agree. However we should differentiate that from absence of the very basic and generally applicable functionality.

Maybe this would be a good starting point for a second nuget package?

I don't think so

I personally think that randomozing collections is somewhat oddly specific for a standard library.

The presence of random functions in Python standard library is one of the many things that lowered the barrier of ML entry for newcomers and it very positively influenced language popularity, we should follow the success path here rather than avoiding it.

dsyme commented 1 month ago

Ultimately the team looking after the dotnet/fsharp repo do really have a right of veto on this.

@KevinRansom I'd request that we get those concerns raised much, much earlier in the design process. e.g. at the suggestion stage, or minimally at the RFC-discussion stage. This RFC PR was open 12 months.

@Lanayx Given @KevinRansom's concerns could we get a measure of the size delta?

@Lanayx I would favour dropping the With variants if it helps with size concerns. I'm ambivalent about those anyway. @Lanayx is there any technical reason to include them, e.g. is passing rnd.Next to the By variant equivalent?

Lanayx commented 1 month ago

@dsyme As for size question, here is measurements Current main: 3261 KB With random changes: 3293 KB So the change is 32KB, which is 1% size change. As for dropping .With method I tend to disagree, since it's the fastest option, 2.5 times faster than .By option

module TestData =
    let arr = Array.init 1000 id

[<MemoryDiagnoser>]
type Rand() =

    [<Benchmark(Baseline = true)>]
    member _.Bcl () =
        let newArr = Array.copy TestData.arr
        Random.Shared.Shuffle newArr
        newArr

    [<Benchmark>]
    member _.RandomWith () =
        TestData.arr |> Array.randomShuffleWith Random.Shared

    [<Benchmark>]
    member _.RandomBy () =
        TestData.arr |> Array.randomShuffleBy Random.Shared.NextDouble

| Method     | Mean      | Error     | StdDev    | Ratio | RatioSD | Gen0   | Allocated | Alloc Ratio |
|----------- |----------:|----------:|----------:|------:|--------:|-------:|----------:|------------:|
| Bcl        |  4.229 us | 0.0650 us | 0.0608 us |  1.00 |    0.00 | 0.4730 |   3.93 KB |        1.00 |
| RandomWith |  3.743 us | 0.0354 us | 0.0331 us |  0.89 |    0.01 | 0.4768 |   3.93 KB |        1.00 |
| RandomBy   | 10.185 us | 0.0871 us | 0.0728 us |  2.41 |    0.04 | 0.4730 |   3.95 KB |        1.01 |

If 1% increase is indeed too much and I had to drop anything from API, I'd drop .By version of functions or Seq module functions (since all those functions internally do caching to the array as a first operation)

brianrourkeboll commented 1 month ago

@Lanayx If I had to guess, the HOF version being slower might be due to the randomizer not being inlined or else the branching when checking whether the value is between 0 and 1. The latter might be harder to address, I guess, although I haven't put thought into it.

Lanayx commented 1 month ago

@brianrourkeboll It will be slower anyway, because of checking value range and because of doing extra calculations of converting float 0..1 value to int min..max value

brianrourkeboll commented 1 month ago

@Lanayx Hmm, it seems to be possible to get randomShuffleBy much closer by making a few tweaks:

Benchmark source

| Method | Mean     | Error     | StdDev    | Ratio | RatioSD |
|------- |---------:|----------:|----------:|------:|--------:|
| Bcl    | 3.386 us | 0.0312 us | 0.0276 us |  1.00 |    0.00 |
| Pr     | 6.721 us | 0.0795 us | 0.0705 us |  1.99 |    0.03 |
| Faster | 3.881 us | 0.0290 us | 0.0271 us |  1.15 |    0.01 |

The lambda does seem to be required[^1] if you want full devirtualization, though. That is,

Faster.randomShuffleBy Random.Shared.NextDouble TestData.arr

is slightly slower (~2 μs on my machine) than

Faster.randomShuffleBy (fun () -> Random.Shared.NextDouble ()) TestData.arr

That's probably because the JIT can devirtualize the call to NextDouble in the (inlined) lambda case but not in the other.

So if a user really cares about maximum performance, they can:

Use Array.randomShuffle or the BCL Shuffle.
Use Array.randomShuffleBy with a lambda.
Bind Random.Shared to a local variable and pass in localRandom.NextDouble.[^1]

[^1]: If you store the Random in a local value, you don't need the lambda:

```fsharp
// Just as fast as (fun () -> Random.Shared.NextDouble ())
[<Benchmark>]
member _.Faster () =
    let random = Random.Shared
    Faster.randomShuffleBy random.NextDouble TestData.arr
```

dsyme commented 1 month ago

Interesting thanks! BTW what's the cost of the locking involved in Random.Shared? Just out of curiosity really.

I think from this I still approve the RFC, and my preference would still be to remove the With variants (@Lanayx is there any reason besides perf to include them?)

For me the deciding factor from a design perspective is that a large number of F# teaching scenarios become much simpler to teach if you just have List.randomShuffle and so on directly available, without having to explain System.Random or any .NET functionality (or the awful Span<T> overloads on System.Random.Shared.Shuffle - a new presence in the .NET world that is really problematic for teaching - imagine trying to explain to a beginner student why their Shuffle overload resolution fails!).

That's really one of the main things FSharp.Core is for: to present a coherent, teachable, usable, portable programming model that captures most common in-memory programming scenarios before moving on to advanced data structures or interop with system features or UX or tensors and so on. Doing basic random permutations has been a part of programming models like this ever since the days of Python and before. It is in Python for a reason, and should be in F# for the same reason. The addition of System.Random.Shared is a factor in this - it indicated that .NET has embraced the same principle. Therefore F# should embrace the principle too from its own perspective.

The owners of dotnet/fsharp still have a veto right on this one. Size is a factor affecting many scenarios that might not be obvious.

brianrourkeboll commented 1 month ago

@dsyme

BTW what's the cost of the locking involved in Random.Shared? Just out of curiosity really.

Do you mean System.Random.Shared in the BCL? If so, it uses an implementation with a ThreadStatic private field for thread-safety instead of explicit locking (that technique is also what lets it devirtualize and then inline calls to Next, etc.).

Compare: https://github.com/dotnet/fsharp/pull/17277#discussion_r1631486448 and https://github.com/dotnet/fsharp/pull/17277#discussion_r1631649968.

dsyme commented 1 month ago

Do you mean System.Random.Shared in the BCL? If so, it uses an implementation with a ThreadStatic private field for thread-safety instead of explicit locking (that technique is also what lets it devirtualize and then inline calls to Next, etc.).

Got it, thanks!

Lanayx commented 1 month ago

For me the deciding factor from a design perspective is that a large number of F# teaching scenarios become much simpler to teach if you just have List.randomShuffle and so on directly available, without having to explain System.Random or any .NET functionality

But List.randomShuffle and so on will be directly available, so this probably doesn't apply? Very early beginners will just be able to use simplest version of random methods while studying. But anyway, System.Random is so central to random operations in .NET that even a beginner will have to learn it very soon. Also learners usually just copy documentation or samples or what ChatGPT tells them.

I think from this I still approve the RFC, and my preference would still be to remove the With variants (@Lanayx is there any reason besides perf to include them?)

From my hands-on perspective ideal F# API that is based on C# experience should the following properties: 1) Simpler or easier to use than C# API 2) Same (or better) performance that C# API 3) Same (or better) functionality set

So I look at different options with regard to those properties:

Simple versions (randomShuffle etc) 1. + 2. A bit worse than C# 3. Random characteristics will be worse than with Random.Shared
.By version (randomShuffleBy etc) 1. A bit harder than C# (passing delegate is harder to comprehend than passing object) 2. Worse than C# (without hacks, with hacks still a bit worse) 3. +
.With version (randomShuffleWith etc) 1. + 2. + 3. +

Based on above I would use the .With option in my work project (with Random.Shared) it was available. If it's not available, I'll have to add one more method to my own collection of missing language functions (near isNotNull) in all projects that need it. Also, if it's not available, what will we recommend to use by default in production? Will we tell to use this snippet:

Array.randomShuffleBy (fun () -> Random.Shared.NextDouble ()) TestData.arr

This would raise a thought in my mind - "They had a chance to make it nice after 20 years, but failed it".

Ultimately the .With option would be not needed it Fsharp.Core targeted .NET 8, because the implementation could just be based on Random.Shared (which is by far the main case for .With option). But when (or if) this transition happens is unknown to me, so I'd keep it. Additional benefit is that there are many alternative implementations (like here) that all inherit System.Random and user can easily leverage one of them using .With versions without sacrificing performance.

krauthaufen commented 1 month ago

Just a thought: How exactly is randomShuffleBy any different from sortBy?

Signature-wise it seems to be identical and also the semantics seem to be the same.

Lanayx commented 1 month ago

Just a thought: How exactly is randomShuffleBy any different from sortBy?

Shuffle algorithmic complexity is O(N)

krauthaufen commented 1 month ago

So the generated double is actually interpreted as an index for classic swap-operations like this?

If so, i think it's very odd (what happens when i return NaN or simply 3.0, etc)

Lanayx commented 1 month ago

So the generated double is actually interpreted as an index for classic swap-operations like this?

Right

If so, i think it's very odd (what happens when i return NaN or simply 3.0, etc)

It's covered in the RFC, the ArgumentOutOfRange exception will be thrown BTW, it's a good case with NaN, i've missed it in the implementation, will need to fix this case

fsharp / fslang-design

FS-1135 - Random functions for collections #732