Closed Lanayx closed 1 month ago
The APIs which can fail on non-null values ( choice,choices,sample with collections that are emty/too small) -> I would favour having a try* variant of them, to avoid developers doing their own wrappers to prevent exceptions.
(personally, this can also wait for an impl PR, but for me it is an important detail): How do you plan to work with "Shared thread-safe Random instance ". Locking of a single global System.Random instance for the duration of entire operation? Or having multiple instances, either per-thread or in a pool?
Since games and simulations were mentioned, it would be good to thing of high perf scenarios:
The "with Random parameter" variants of the API : Does it only allow System.Random, or is https://learn.microsoft.com/en-us/dotnet/api/system.security.cryptography.randomnumbergenerator?view=net-7.0 considered? (an abstraction covering both regular System.Random as well as RandonNumberGenerator could be requiring a random generating function as an argument).
@T-Gro Hi, thank you for the review 1) Not all of collection functions have try variant, only the ones where it makes sense. For example: max, average, reduce - just throw and that is ok for me. If function implies that collection is not empty, it is ok to throw, and choice,choices,sample imply that. 2) The second option, you can see my implementation here https://github.com/Lanayx/fsharp/blob/master/src/FSharp.Core/Random.fs 3) Agree, inplace shuffling can be useful, I'll add that 4) System.Random is not sealed and it doesn't implement RandomNumberGenerator, so if someone wants use the latter he can inherit System.Random and implement virtual methods (BTW System.Random.Shared does exactly this)
I'm concerned about naming being confusing when colliding with existing functions, esp. List.choices
v. List.choose
. Perhaps
It feels like the use of these functions is rare enough that adding "random" does help clarity here without too much cost.
Please list naming of equivalent python functions
Can we get any stats on use of equivalent python functions
I'm concerned there's no weighted choice option (as mentioned in the unresolved questions)
I'm concerned there are no functions to choose a single element.
I'm bit concerned by having both "randomChoices" and "randomSample" without the naming aligning, but let's first take a look at the Python naming, also perhaps other languages.
p.s. great to see progress on this
@dsyme thank you for the comments
1) I don't like the random prefix, it makes me feel like those functions shouldn't belong there at all, while for example List.shuffle
feels nice like it has always been there. If you are strongly against extending List module - we can consider creating a separate module Random (which will be static class), so the code will look like
[1; 2; 3] |> Random.shuffle
[|1; 2; 3|] |> Random.shuffle
This will allow us to use overloads, but prevent from using partial application. On one hand it is ok, on the other - it is not what the other collection functions look like, so it's not that consistent.
2) Passing Random seems to fit better than function, since I think we should make it easy to pass new Random.Shared around. Still using shuffleWith
instead of shuffleRand
feels good with me.
3) Idk about python functions usage, but I'm sure that every machine learning program should include random shuffling of the data between several epochs.
4) Python: shuffle
, choice
, choices
, sample
(currently all names are taken from there). Java, Scala: shuffle
only. Rust: choose
, choose_multiple
, choose_weighted
, shuffle
, partial_shuffle
, choose_multiple_fill
, choose_multiple_weighted
. underscorejs: shuffle
, sample
5) choice
is the function to get single element
6) As for weights and counts parameters - my feeling is that they are more specific to data science, so if somebody needs them - they can contribute later, so the the functions will be for example List.choiceWeights
and List.sampleCounts
7) I understand concern about choose and choice wording, but that's a minor issue for me. There are much bigger intersections in F# with .Net base types, like Array
or String
being a module and a class, so functions are intersected with .net methods. Or having ...Async methods in different libraries returning Task
and not Async
etc
This looks fine to me overall.
I don't think we need a separate Random
module for these functions, it will be a bit confusing on how it's related to collections.
As for naming - I don't have a strong opinion on that, since I don't have much experience in using those functions, but I share @dsyme's concern about naming - once they're in, there's no way for us changing or removing them in case they're confusing or lack some functionality.
I personally think, that the safest way is to align naming and signatures to what existing functions in .NET and Python have.
@dsyme I created a voting in the discussion and it seems that people like the idea of prefix, so I changed all the names in PR
Due to the comment I've changed prefixes to submodule, since such design is already implemented in Array.Parallel
@Lanayx Sorry for the churn, but I don't think the module is the right way to go - could you revert to the prefix please?
Thanks!
After we've resolved the unresolved question, I'll give this RFC one more quick review and then I think I can merge it for you.
@dsyme and @vzarytovskii, this proposal is ready. If you can give it your thumbs up, I can merge it (or you can ;) ).
/cc reminder for tomorrow: @vzarytovskii
Merged! Thank you everyone for your hard work on this
@dsyme -- please consider the revoking the approval for this suggestion.
FSharp.Core is already too big, and so far all attempts to refactor it to make it smaller have failed. Useful helper functions such as these belong in an external library where they can be selected by a developer who is working on scenarios where this type of functionality is useful.
We currently ship fsharp.core as a nuget package embedded within the dotnet sdk:
In general we have only added APIs to FSharp.Core when they were widely applicable and supported an idiomatic programming style, or interop with C# assemblies of course some that don't match these criteria have slipped in.
Sure these APIs are small and compact, however, so are the many other APIs we could add, these suggested APIs deserve to live in a support library rather than FSharp.Core, someone should work on it. Because we have many gaps with the Numpy data library and FSharp.Core is not the place to address them.
FSharp.Stats?
get pressure from the SDK team to become an optional install feature because of our size and relatively small user base
It is not because F# is "small user base" (and that this line of thinking can lead to eradication of it) that it shouldn't be part of the SDK, unless we must make the CLR a C# only thing OR the SDK would come with no compilers, just the assemblies, msbuild stuff and dotnet tool. This also would save space in the SDK...
In anycase, I concur with your analysis, and also increasing the API surface for the compiler team, however useful, there are several packages that already extend FSharp.Core (https://fsprojects.github.io/FSharp.Collections.ParallelSeq/).
We should keep the fslang-design process though for those extensions though.
Because we have many gaps with the Numpy data library and FSharp.Core is not the place to address them.
Random functions are not part of Numpy, they are part of Python standard library, since they are very basic and widely applicable and were really "missing" in F# since inception https://docs.python.org/3/library/random.html
I too think that bloated libraries are hard to maintain and refactor since i have experience with it sadly. Maybe this would be a good starting point for a second nuget package? Could live in this repo and just produce something called FSharp.Collections maybe? I personally think that randomozing collections is somewhat oddly specific for a standard library. Just my thoughts
bloated libraries are hard to maintain and refactor
Fully agree. However we should differentiate that from absence of the very basic and generally applicable functionality.
Maybe this would be a good starting point for a second nuget package?
I don't think so
I personally think that randomozing collections is somewhat oddly specific for a standard library.
The presence of random functions in Python standard library is one of the many things that lowered the barrier of ML entry for newcomers and it very positively influenced language popularity, we should follow the success path here rather than avoiding it.
Ultimately the team looking after the dotnet/fsharp
repo do really have a right of veto on this.
@KevinRansom I'd request that we get those concerns raised much, much earlier in the design process. e.g. at the suggestion stage, or minimally at the RFC-discussion stage. This RFC PR was open 12 months.
@Lanayx Given @KevinRansom's concerns could we get a measure of the size delta?
@Lanayx I would favour dropping the With
variants if it helps with size concerns. I'm ambivalent about those anyway. @Lanayx is there any technical reason to include them, e.g. is passing rnd.Next
to the By
variant equivalent?
@dsyme As for size question, here is measurements
Current main: 3261 KB
With random changes: 3293 KB
So the change is 32KB, which is 1% size change. As for dropping .With
method I tend to disagree, since it's the fastest option, 2.5 times faster than .By
option
module TestData =
let arr = Array.init 1000 id
[<MemoryDiagnoser>]
type Rand() =
[<Benchmark(Baseline = true)>]
member _.Bcl () =
let newArr = Array.copy TestData.arr
Random.Shared.Shuffle newArr
newArr
[<Benchmark>]
member _.RandomWith () =
TestData.arr |> Array.randomShuffleWith Random.Shared
[<Benchmark>]
member _.RandomBy () =
TestData.arr |> Array.randomShuffleBy Random.Shared.NextDouble
| Method | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
|----------- |----------:|----------:|----------:|------:|--------:|-------:|----------:|------------:|
| Bcl | 4.229 us | 0.0650 us | 0.0608 us | 1.00 | 0.00 | 0.4730 | 3.93 KB | 1.00 |
| RandomWith | 3.743 us | 0.0354 us | 0.0331 us | 0.89 | 0.01 | 0.4768 | 3.93 KB | 1.00 |
| RandomBy | 10.185 us | 0.0871 us | 0.0728 us | 2.41 | 0.04 | 0.4730 | 3.95 KB | 1.01 |
If 1% increase is indeed too much and I had to drop anything from API, I'd drop .By
version of functions or Seq
module functions (since all those functions internally do caching to the array as a first operation)
@Lanayx If I had to guess, the HOF version being slower might be due to the randomizer
not being inlined or else the branching when checking whether the value is between 0 and 1. The latter might be harder to address, I guess, although I haven't put thought into it.
@brianrourkeboll It will be slower anyway, because of checking value range and because of doing extra calculations of converting float 0..1 value to int min..max value
@Lanayx Hmm, it seems to be possible to get randomShuffleBy
much closer by making a few tweaks:
| Method | Mean | Error | StdDev | Ratio | RatioSD |
|------- |---------:|----------:|----------:|------:|--------:|
| Bcl | 3.386 us | 0.0312 us | 0.0276 us | 1.00 | 0.00 |
| Pr | 6.721 us | 0.0795 us | 0.0705 us | 1.99 | 0.03 |
| Faster | 3.881 us | 0.0290 us | 0.0271 us | 1.15 | 0.01 |
The lambda does seem to be required[^1] if you want full devirtualization, though. That is,
Faster.randomShuffleBy Random.Shared.NextDouble TestData.arr
is slightly slower (~2 μs on my machine) than
Faster.randomShuffleBy (fun () -> Random.Shared.NextDouble ()) TestData.arr
That's probably because the JIT can devirtualize the call to NextDouble
in the (inlined) lambda case but not in the other.
So if a user really cares about maximum performance, they can:
Array.randomShuffle
or the BCL Shuffle
.Array.randomShuffleBy
with a lambda.Random.Shared
to a local variable and pass in localRandom.NextDouble
.[^1][^1]: If you store the Random
in a local value, you don't need the lambda:
```fsharp
// Just as fast as (fun () -> Random.Shared.NextDouble ())
[<Benchmark>]
member _.Faster () =
let random = Random.Shared
Faster.randomShuffleBy random.NextDouble TestData.arr
```
Interesting thanks! BTW what's the cost of the locking involved in Random.Shared? Just out of curiosity really.
I think from this I still approve the RFC, and my preference would still be to remove the With variants (@Lanayx is there any reason besides perf to include them?)
For me the deciding factor from a design perspective is that a large number of F# teaching scenarios become much simpler to teach if you just have List.randomShuffle
and so on directly available, without having to explain System.Random or any .NET functionality (or the awful Span<T>
overloads on System.Random.Shared.Shuffle
- a new presence in the .NET world that is really problematic for teaching - imagine trying to explain to a beginner student why their Shuffle overload resolution fails!).
That's really one of the main things FSharp.Core is for: to present a coherent, teachable, usable, portable programming model that captures most common in-memory programming scenarios before moving on to advanced data structures or interop with system features or UX or tensors and so on. Doing basic random permutations has been a part of programming models like this ever since the days of Python and before. It is in Python for a reason, and should be in F# for the same reason. The addition of System.Random.Shared is a factor in this - it indicated that .NET has embraced the same principle. Therefore F# should embrace the principle too from its own perspective.
The owners of dotnet/fsharp still have a veto right on this one. Size is a factor affecting many scenarios that might not be obvious.
@dsyme
BTW what's the cost of the locking involved in Random.Shared? Just out of curiosity really.
Do you mean System.Random.Shared
in the BCL? If so, it uses an implementation with a ThreadStatic
private field for thread-safety instead of explicit locking (that technique is also what lets it devirtualize and then inline calls to Next
, etc.).
Compare: https://github.com/dotnet/fsharp/pull/17277#discussion_r1631486448 and https://github.com/dotnet/fsharp/pull/17277#discussion_r1631649968.
Do you mean System.Random.Shared in the BCL? If so, it uses an implementation with a ThreadStatic private field for thread-safety instead of explicit locking (that technique is also what lets it devirtualize and then inline calls to Next, etc.).
Got it, thanks!
For me the deciding factor from a design perspective is that a large number of F# teaching scenarios become much simpler to teach if you just have List.randomShuffle and so on directly available, without having to explain System.Random or any .NET functionality
But List.randomShuffle
and so on will be directly available, so this probably doesn't apply? Very early beginners will just be able to use simplest version of random methods while studying. But anyway, System.Random
is so central to random operations in .NET that even a beginner will have to learn it very soon. Also learners usually just copy documentation or samples or what ChatGPT tells them.
I think from this I still approve the RFC, and my preference would still be to remove the With variants (@Lanayx is there any reason besides perf to include them?)
From my hands-on perspective ideal F# API that is based on C# experience should the following properties: 1) Simpler or easier to use than C# API 2) Same (or better) performance that C# API 3) Same (or better) functionality set
So I look at different options with regard to those properties:
Based on above I would use the .With option in my work project (with Random.Shared) it was available. If it's not available, I'll have to add one more method to my own collection of missing language functions (near isNotNull
) in all projects that need it. Also, if it's not available, what will we recommend to use by default in production? Will we tell to use this snippet:
Array.randomShuffleBy (fun () -> Random.Shared.NextDouble ()) TestData.arr
This would raise a thought in my mind - "They had a chance to make it nice after 20 years, but failed it".
Ultimately the .With option would be not needed it Fsharp.Core targeted .NET 8, because the implementation could just be based on Random.Shared
(which is by far the main case for .With option). But when (or if) this transition happens is unknown to me, so I'd keep it. Additional benefit is that there are many alternative implementations (like here) that all inherit System.Random
and user can easily leverage one of them using .With versions without sacrificing performance.
Just a thought: How exactly is randomShuffleBy any different from sortBy?
Signature-wise it seems to be identical and also the semantics seem to be the same.
Just a thought: How exactly is randomShuffleBy any different from sortBy?
Shuffle algorithmic complexity is O(N)
So the generated double is actually interpreted as an index for classic swap-operations like this?
If so, i think it's very odd (what happens when i return NaN or simply 3.0, etc)
So the generated double is actually interpreted as an index for classic swap-operations like this?
Right
If so, i think it's very odd (what happens when i return NaN or simply 3.0, etc)
It's covered in the RFC, the ArgumentOutOfRange
exception will be thrown
BTW, it's a good case with NaN, i've missed it in the implementation, will need to fix this case
Click “Files changed” → “⋯” → “View file” for the rendered RFC (i.e.: here).
Discussion: in this thread