Open AlexDaniel opened 5 years ago
@moritz any thoughts on this?
Just as a side note about stats on non-scalar data -- if you need more than one statistic, there's often a large performance advantage to calculating some or all of them at once in a single pass through the data. Certainly for known-immutable data it would be easy to cache the results for some statistics while calculating others, but in the general case it would be useful to have some way to request calculation of several stats (particularly commonly used ones) at once, without having to hand-roll one's own calculations -- the latter being frankly an easy way for non-experts to fall prey to all sorts of numerical stability issues.
IMHO these belong into a statistics module.
The naming is not obvious, (don't tell me you want a function called ndtr
by default in the setting in Perl 6, please; and I don't know if average
or avg
or mean
is the best), as are the performance issues that @japhb mentioned.
Has anybody written such a module? This is a perfect use case of something that can be prototyped and ironed out outside the core language.
If there's a really well-working module, we might consider inclusion in core (though I still think it's out of scope for Perl 6).
though I still think it's out of scope for Perl 6
Well, Evan Miller makes a point that these should be part of the standard library. Then there's also:
☞ Math just is. Don’t make people declare it.
And also it makes me wonder why something like acosech is in core but a commonly needed mean
is not.
I agree, however, that the first implementation of all that can be done in a module.
don't tell me you want a function called ndtr by default in the setting in Perl 6, please
Of course not. From the article:
The Cephes folks seem to be stingy when it comes to doling out letters in function names, so the C committee may want to add a few characters to the above for clarity
☞ Math just is. Don’t make people declare it.
Yet none of us are trying to turn Perl 6 into a fully-featured Computer Algebra System.
(Side note, people have, in fact, proposed that in the past, but @TimToady has stopped them).
We have to draw a boundary somwhere. For me, the boundary excludes the beta and gamma-related functions.
We can argue about mean
, if you want, but then please be more precise about its semantics (what will it return for the empty list, for example?). Why "mean" as the name (when there is a Geometric Mean as well as the "normal" arithmetic mean), why not "average"?
mean/average and standard deviation suffer from the performance penalty of multi-pass calculations, which is why I think that a regular function interface might not be the best. Which is why somebody should first come up with a working design in form of a module.
FWIW, I think a clamp
method should take a Range
(or 2 values) as parameter. This would allow it to be used on e.g. a List, a Supply, etc:
42.clamp(^10); # 9
(10,20,30).clamp( 25..35 ); # (25,25,30)
etc. etc.
There's a wide variety of suggested additions here. I'm in principle not opposed to adding things to CORE.setting
, but there should be a good argument for those we do add, as well as a lack of strong counter-arguments for not adding them.
A general counter-argument is that everyone pays for the things we put into CORE.setting
: its compiled form is over 14 MB by now, which everyone has to download, store, have mapped into memory, and so forth. While there will be technical measures we can take to make it more compact, and try to further reduce the impact the setting size has on startup time, additions there will never be free. (Some argue "it makes the language bigger and so more to learn", but for things in CORE.setting
I don't really buy that argument; you don't have to know all of a language's standard library in order to use the language. Or at least, I sure hope not, or I should stop programming. :-))
One consideration that has not yet been mentioned here is whether there is a significant performance benefit to be had from providing the operation as a built-in. If, for example, some platforms provide for doing the operation at CPU level, or there exists a means to implement it more efficiently than would be possible through the composition of other operations, then there's a case for having it in CORE.setting
so we can JIT it into something good. I've no idea if this is the case for any of those suggested here; research is needed.
It's also worth considering how widely used something would be. For example, there's probably a quite strong case for average
, which for most people means sub average(@xs) { @xs.sum / @x }
, even if there are many other kind of average. I suspect that's been defined by quite a few folks by now (and it's so short/simple to write, it's not really worth a module dependency).
A few assorted notes on various of the proposals:
prod
- if we were to add it - would want to be product
.clip
seems a more evocative name to me than clamp
. Also, I think (10,20,30).clamp( 25..35 );
should just be done using a map
over the list, applying it to each element. It's arguably a useful enough thing to have it CORE.setting
, but it's not an obvious list operation.As for a way forward:
CORE.setting
:
clip
(or clamp
, or bound
, or whatever we end up calling it)average
(with the semantics that the typical punter expects); I know that if you're doing other statistical things then it's more efficient to do it in a single pass, but my feeling is that - for better or worse - simple averaging is overwhelmingly the most commonly done thing. Even if we did later decide a means to calculate a bunch of statistical things at once belonged in CORE.setting
, that'd still not take much from the value of a convenient average
built-in.product
pulls its weight, especially since @x.product
is more to type than [*](@x)
, and unlike sum
, I don't see any obvious optimization opportunities (we ended up with .sum
because, if done on a Range
, you can calculate the answer without iterating the Range
). However, I'd entertain arguments for why it should be included.To clarify the situation in this ticket: there was no proposal yet, the original post is simply stating that some functions may be missing. If somebody wants to make a proposal, see @jnthn's comment.
jnthn: Aside from the pure performance implications of using builtins, there's also a matter of numerical accuracy and stability; some of these functions may need to be calculated using the processor's extended precision (e.g. 80, 96, or 128 bits) in order to be accurate to one ULP (Unit in the Last Place) of their 64-bit output across their domain. Which is to say that some of them we'd just want to implement as VM ops or NativeCall to a math lib anyway, because we can't efficiently fake that extra precision in NQP space.
This annoyed me enough to create https://github.com/MattOates/Stats/blob/master/lib/Stats.pm6
The Evan Miller article is really great. The point of the functions he defines is its a core set of operations most of the rest of scientific programming is actually based on at a higher level.
If this were in the CORE.setting though I think it makes more sense to really limit what gets exported or provided, like mean/median/mode/stddev are common. A core module with use Maths
for the more science/analysis end of the spectrum feels sensible. Outside of Rakudo Star do we have properly core level modules?
Really it would be great if these were optimised implementations from a maths library. Otherwise its hard to see how this might not stunt or slow down development in the ecosystem in this space.
Outside of Rakudo Star do we have properly core level modules?
Yes, Telemetry comes to mind.
This annoyed me enough …
@MattOates, by any chance can you come up with a detailed proposal (discussing what should be available to everyone, what needs to be in a potential maths module and what is left for the ecosystem), and later an implementation? I think nobody objects that at least some things need to be added, we just need a knowledgeable person with enough tuits to think this through.
A Google search for the mean function in Raku led me to this page. I'm just starting to move to Raku (from Python previously) because I'm interested in language processing, not maths or statistical analysis, but I'm surprised to learn that some of the most basic statistical functions are missing. It's not too much trouble to find the mean myself and I don't want to advanced maths, but there does seem to be something wrong if the mean and standard deviation aren't part of the core language. I don't need to install a module to for square root (do I?!)
You don't need to install a module for square root:
say sqrt(1764); # 42
It appears unlikely that any of this functionality will make it to core, unless it has been written in userland as a module first.
Having said that, I would be willing to put in some time to create such a statistics module. So please tell me what functions you need, what the exactl algorithm should be (preferably by writing tests for them).
Thanks lizmat, that's kind of you, and it's good to know that at least Raku has a square root function. At the moment I only need a small selection of basic functions for summary statistics, to describe the middle (mean, mode, median), spread (standard deviation, range, percentiles), and shape (any measure of skewness) of a distribution of values for a single variable. I'd be happy to describe the algorithms if that would be helpful, although I'm no authority and I think they should follow those commonly given in introductory texts.
Part of the discussion about these operators, are decisions on the exact semantics. I am willing to write the code, I'm not willing to start researching into what people want :-) That's why I'm asking to please be specific.
I wasn't only motivated by my own needs to post here; I also wanted to add my voice, for whatever that's worth, to confirm the OP's observation that there's a gaping void where statistics functions should be. If Raku is to be promoted as a general-purpose programming language then it matters. I'm just a beginner in Raku but I think I can write the simple stats functions I need myself, as I need them. It's just a bit inconvenient and bewildering that it's necessary.
The basic functions I suggested were very specific. This is how they're calculated:
mean : the sum of values divided by the number of values mode : value with the greatest frequency median : middle value when all values are arranged from highest to lowest (when there's an even number of values, the median is the mean of the middle 2 values). standard deviation : the square root of the variance. the variance is the sum of squared deviations from the mean.
To find the variance:
range : the difference between the highest and lowest values. percentiles : the p percentage of values below the p-th percentile, e.g. 10% of values are below the 10th percentile. skewness : is a measure of whether the distribution is lop-sided. there are different ways to calculate this. One way (i.e. alternative Pearson Mode Skewness) is: skew = 3 * (mean – median) / standarddeviation
Note in passing:
Variance step 2 and step 4 above hide opportunities for precision loss (because of cancellation of matching mantissa bits during the subtract, or adding values with different exponents in the sum). I believe there are more precise/stable algorithms for variance (I'm certain there are for sum), though I'd have to go looking through my numerical analysis texts to find the best ones for us.
Some examples of things that are missing:
clamp
orclip
https://stackoverflow.com/questions/55250700/is-there-a-clamp-method-sub-for-ranges-num-etc-in-perl6double incbet(double a, double b, double x); # Regularized incomplete beta function
double incbi(double a, double b, double y); # Inverse of incomplete beta integral
double igam(double a, double x); # Regularized incomplete gamma integral
double igamc(double a, double x); # Complemented incomplete gamma integral
double igami(double a, double p); # Inverse of complemented incomplete gamma integral
double ndtr(double x); # Normal distribution function
double ndtri(double y); # Inverse of Normal distribution function
double jv(double v, double x); # Bessel function of non-integer order
prod
. It's easy to do it yourself but if we havesum
then why not haveprod
too (for example, numpy has both)mean
median
mode
?peak-to-peak
(range) – (numpy example)standard-deviation
histogram