fslaborg / FSharp.Stats

statistical testing, linear algebra, machine learning, fitting and signal processing in F#
https://fslab.org/FSharp.Stats/
Other
205 stars 54 forks source link

[Feature Request] NaN safety, we probably need something more than doc strings. #311

Open smoothdeveloper opened 7 months ago

smoothdeveloper commented 7 months ago

In context of machine learning, many of the optimization algorithms rightfully preclude the presence of NaN values.

The documentation of the function may sometime mention, or not mention if a function can return NaN, and also, how it process NaN as input.

Alas, this is not systematically described, and also, people will just try functions left and right, when they are doing exploratory feature engineering.

The first focus would be to make sure the library offers some batteries included for those that don't want to find out "too late" in the pipeline (as they are long to setup, adjust, run, troubleshoot, etc.).

Without going too far in terms of how to make things perfect, and most sophisticated for long term maintenance, in all places, there is a plan that could bring some safety and long term maintainability:

One can dream :)

In the meantime:

related: #280

smoothdeveloper commented 7 months ago
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>

I think we can ensure consistency based on presence of this, which seems to be in place (but it is not really discoverable in code, nor in the documentation pages.

We can also define F# analyzer that looks for functions like sqrt, that are bound to produce NaN.

If someone who groks maths (not me) could list here the F# and BCL functions that produce NaN that are used in this library, it would help with the implementation of such analyzer.

smoothdeveloper commented 7 months ago

One issue with open FSharp.Stats.NumericallySafe approach, is you can only switch in your code using #if precompiler directives, or otherwise, you need to pass references to functions, rebind them in your own module based on some context.

There are scenarios where I'd want this to be done without recompiling nor forcing to rebind each function of interest.