Going scientific - Githubissues

Night-walker commented 10 years ago

Let's see what else I have in my stash... aha, a statistics module!

# Returns arithmetic mean of \a data.
# E[X] = Σx / N
mean(invar data: array<@T<int|float|double>>) => double

# Returns variance of \a data (measure of spread) of the given \a kind. 
# Uses \a mean if it is given
# Sample:       σ²[X] = Σ(x - E[X]) / (N - 1)
# Population:   σ²[X] = Σ(x - E[X]) / N
variance(invar data: array<@T<int|float|double>>, 
               kind: enum<sample,population> = $sample) => double
variance(invar data: array<@T<int|float|double>>, mean: double, 
               kind: enum<sample,population> = $sample) => double

# Returns median (middle value) of \a data while partially sorting it. 
# If \a data size is even, the mean of two middle values is returned
median(data: array<@T<int|float|double>>) => double

# Returns percentile \a percentage (the value below which the given percentage of
# sample values fall) of \a data while partially sorting it. \a percentage must be 
# in range (0; 100) 
percentile(data: array<@T<int|float|double>>, percentage: double) => double

# Returns mode (most common value) of \a data
mode(invar data: array<@T<int|float|double>>) => @T

# Returns minimum and maximum value in \a data
range(invar data: array<@T<int|float|double>>) => tuple<min: @T, max: @T>

# Returns distribution of values in \a data in the form \c value => \c frequency,
# where \c value is a single unique value and \c frequency is the number
# of its appearances in \a data
distribution(invar data: array<@T<int|float|double>>) => map<@T,int>

# Returns values of \a data grouped into ranges of width \a interval starting
# from \a start. The result is in the form \c index => \c frequency corresponding
# to the ranges present in the sample. \c index identifies the range, it is equal
# to integer number of intervals \a interval between \a start and the beginning
# of the particular range; the exact range boundaries are 
# [\a start + \c floor(\c index / \a interval); 
#  \a start + \c floor(\c index / \a interval) + \a interval).
# \c frequency is the number of values which fall in the range. The values lesser
# then \a start are not included in the resulting statistics
distribution(invar data: array<@T<int|float|double>>, interval: double, 
                  start = 0.0) => map<int,int>

# Returns correlation \a coefficient between \a data1 and \a data2. 
# Pearson coefficient measures linear dependence, Spearman's rank 
# coefficient measures monotonic dependence. If \a mean1 and \a mean2 
# are given, they are used for calculating Pearson coefficient.
# \note \a self and \a other must be of equal size
# Pearson:          r[X,Y] = E[(X - E[X](Y - E[Y])] / σ[X]σ[Y]
# Spearman's rank:  ρ[X,Y] = r(Xrank, Yrank)
correlation(invar data1: array<@T<int|float|double>>, invar data2: array<@T>,
                  coefficient: enum<pearson,spearman>) => double
correlation(invar data1: array<@T<int|float|double>>,invar data2: array<@T>,
                  coefficient: enum<pearson,spearman>, mean1: double, 
                  mean2: double) => double

# Returns skewness (measure of asymmetry) of \a data. Uses \a mean 
# if it is given.
# γ1[X] = E[((x - E[X]) / σ)^3]
skewness(invar data: array<@T<int|float|double>>) => double
skewness(invar data: array<@T<int|float|double>>, mean: double) => double

# Returns kurtosis (measure of "peakedness"). Uses \a mean if it is given
# γ2[X] = E[((x - E[X]) / σ)^4] - 3
kurtosis(invar data: array<@T<int|float|double>>) => double
kurtosis(invar data: array<@T<int|float|double>>, mean: double) => double

Here's the code, the only thing left to implement is special handling of array slices.

dumblob commented 10 years ago

Have you considered adding support for the complex type? E.g. for mean value, two possibilities are coming into consideration as described in the discussion on mathworks 30189 thread.

Night-walker commented 10 years ago

I don't think statistical concepts are applicable to complex numbers, even if it is possible to define the mean for this case. Complex numbers is a purely mathematical concept as far as I know it, while statistics operates on samples of real-world data.

dumblob commented 10 years ago

I have the same perception about domains of statistics and "pure" mathematics in this case, but I think if we have built-in support for complex numbers, It should be supported wherever one can count with numbers (keep in mind, this module contains statistical functions, but it doesn't mean, that e.g. variance value is not used anywehere else with complex numbers).

dumblob commented 10 years ago

Btw a date author line is missing under the license in the source code :)

Night-walker commented 10 years ago

I have the same perception about domains of statistics and "pure" mathematics in this case, but I think if we have built-in support for complex numbers, It should be supported wherever one can count with numbers (keep in mind, this module contains statistical functions, but it doesn't mean, that e.g. variance value is not used anywehere else with complex numbers).

There was a simple rule I was following writing this module. If GSL doesn't support something, it's most likely not very useful or too specific. Supporting something just because it can be supported is essentially a graphomania :)

Also, don't mix statistics with built-in arithmetic support for complex numbers. Arithmetic operations are purely abstract, the mathematics itself does not operate with the notions like practical meaning and usefulness. But statistics is an applied discipline which uses mathematics to analyze objects and processes which really exist, and yield practical results. It simply does not operate in the domain of complex numbers, and that's that.

Btw a date author line is missing under the license in the source code :)

Certainly, as I am no prophet and cannot foresee when it will be committed :)

dumblob commented 10 years ago

If GSL doesn't support something, it's most likely not very useful or too specific.

This logic makes sense. We can always implement support for complex numbers in the future if there is some demand for it.

Certainly, as I am no prophet and cannot foresee when it will be committed :)

That sound like misunderstanding because I thought the date should mark "first release" - i.e. usually some internal down-stream release and not the time of commit (which is anyway stored somewhere in VCS).

Night-walker commented 10 years ago

That sound like misunderstanding because I thought the date should mark "first release" - i.e. usually some internal down-stream release and not the time of commit (which is anyway stored somewhere in VCS).

Well, there is no "first release" either, and I don't know when the module will be deemed ready for such "release".

dumblob commented 10 years ago

Well, there is no "first release" either, and I don't know when the module will be deemed ready for such "release".

In such cases, I'm trying to put there the first date when the piece of code switched from "playground for testing dao features" mode to "I'll not delete this code as it'll be useful and worth developing" to document the "birth".

Night-walker commented 10 years ago

Done with slice handling.

daokoder commented 10 years ago

Sorry for the later response, but this is very cool! I have been looking forward to such modules.

Now let's consider where to put this module, I would consider to create a modules/statistics folder, and put statistics related modules there. Each statistics module would normally consist of a dao_xxx.c (possibly with additional dao_xxx.h, and subfolder can be created if the module is a bit sophisticate for single files) file, and become a single compiling unit. For example, there could be modules/statistics/dao_common.c and modules/statistics/dao_tests.c etc., they will be compiled into modules/statistics/libdao_common.so and modules/statistics/libdao_tests.so, then they can be loaded with load statistics.common and load statistics.tests. Then there should be something like modules/statistics/statistics.dao which will loaded all the sub modules.

I am not sure what should we name this module, should it be core, common or basics etc.?

Night-walker commented 10 years ago

As you can see, I put a bit of this and a bit of that in the module. Mean, variance and median are basics, while higher moments and correlation are probably not. Distribution methods are of my own devising, they are not related to any statistical characteristics at all.

I think common describes it best.

dumblob commented 10 years ago

Yep, common sounds best.

daokoder commented 10 years ago

Let's add this module the repo.

Night-walker commented 10 years ago

Done.

Night-walker commented 10 years ago

About os.fs. You must be doing it wrong :) You must be using an outdated MinGW version, as there is no X86_WIN64 macro defined for me, while all those things you altered are definitely present and work on my MinGW 4.8.2. Using an old MinGW is simply dangerous, as I've seen ridiculous bugs like broken standard library functions. Catering for every MinGW version with its specific limitations and bugs is a maintainer's nightmare.

daokoder commented 10 years ago

X86_WIN64 is defined by makefile.dao. My MinGW installation is outdate, but I am not sure if the latest MinGW works on Windows XP, which is supposedly obsolete. Maybe I am wrong, I will try it some other time later, for now let's make it work on such "outdate" platforms.

daokoder commented 10 years ago

Please look more carefully, my changes are not just for MinGW, but also for 32 bits platform. Though 32 bits windows platform may fade eventually, it should be supported now.

Night-walker commented 10 years ago

Please look more carefully, my changes are not just for MinGW, but also for 32 bits platform. Though 32 bits windows platform may fade eventually, it should be supported now.

I don't see anything preventing those *64 structs and functions to be available on a x32 system, and MSDN doesn't mention anything like that either. They work on my 32-bit MinGW. If they aren't available for you, it is probably again because of an old MinGW version.

daokoder commented 10 years ago

These 64 bits structs and functions are for 64 bits time and files, so it has nothing to do with 32/64 bits machine. I will try the latest MinGW on Windows XP. I hope it will work, otherwise, we do need an option to use 32 bits struct and functions, as XP is still widely used.

Night-walker commented 10 years ago

I happened to use MinGW on Windows XP myself for a significant period of time and can assure you that it works just fine. Those functions and structs are almost certainly available too.

daokoder commented 10 years ago

Then, this is settled.

Another minor issue, we discussed to put these statistics methods in a submodule named common. It's better that way, but since we will not have other statistics functionalities any time soon, so it is not a problem now. Maybe this issue can be closed.

Night-walker commented 10 years ago

Module structure can be altered at any time; until some more statistics-related stuff is added (which may not be necessary at all for the standard module repository), statistics is the most obvious choice. For now, the work is done.

daokoder / dao

Going scientific #183