ddof parameter in FData.cov

pcuestas commented 1 year ago

The current behavior of the FData.cov method is to normalize the covariance by $N-1$ (default normalization applied by np.cov). This limits the functionality of the covariance function. In order to provide a more flexible covariance method, a parameter should be added so that this normalization coefficient can be specified by the user.

Describe the solution you'd like Add a parameter ddof to FData.cov (by default ddof=1) such that the normalization function applied by FData.cov is N-ddof.

ddof: int, optional “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.

vnmabus commented 1 year ago

Note that this parameter should be also added to var.

pcuestas commented 1 year ago

I understand that ddof should also be 1 by default in var. This is going to change the default behavior of var, as it now normalizes by N.

vnmabus commented 1 year ago

That is right. I do not think that changing it will cause problems, as it is not a very used function.

pcuestas commented 1 year ago

Some tests are failing because of the modification in var.

================================== short test summary info ===================================
FAILED skfda/ml/clustering/_kmeans.py::skfda.ml.clustering._kmeans.FuzzyCMeans
FAILED skfda/tests/test_neighbors.py::TestNeighbors::test_score_functional_response - AssertionError: 
FAILED skfda/tests/test_scoring.py::TestScoreFunctionsGrid::test_all - AssertionError: 
FAILED skfda/tests/test_scoring.py::TestScoreFunctionGridBasis::test_all - AssertionError: 0.9859361757845293 != 0.992968013919044 within 2 places (0.00703183813451...
FAILED skfda/tests/test_scoring.py::TestScoreZeroDenominator::test_zero_r2 - AssertionError:

I am going to write ddof=0 in the calls to var that appear in the code:

This solves the issues with the tests.

vnmabus commented 1 year ago

Do not "just" write ddof=0. It would be best to analyze the intended denominator in each place.

pcuestas commented 1 year ago

Sorry, I thought those functions were already intentionally using $1/N$ as the variance normalization.

First case

A function _var is defined: https://github.com/GAA-UAM/scikit-fda/blob/415796bd08fedff373961aa3c287902364d70233/skfda/misc/scoring.py#L70-L82 and it is only used in two places.

Here:

https://github.com/GAA-UAM/scikit-fda/blob/415796bd08fedff373961aa3c287902364d70233/skfda/misc/scoring.py#L245-L250 where the normalization coefficient is not relevant, due to the division in line 250.

And here: https://github.com/GAA-UAM/scikit-fda/blob/415796bd08fedff373961aa3c287902364d70233/skfda/misc/scoring.py#L1011-L1020 where the stats.var function is called iff sample_weight=None, in which case the variable ss_res is also normalized by N by mean, so I think ddof should be 0 in this first case.

Second case (https://github.com/GAA-UAM/scikit-fda/blob/415796bd08fedff373961aa3c287902364d70233/skfda/ml/clustering/_kmeans.py#L150)

Here the variance is used only to calculate the tolerance used by the K-Means algorithm to check for convergence:

https://github.com/GAA-UAM/scikit-fda/blob/415796bd08fedff373961aa3c287902364d70233/skfda/ml/clustering/_kmeans.py#L149-L153

I believe that ddof=0 can be acceptable here but I do not have any arguments against ddof=1 other than the fact that the tests were built taking into account the previous definition of tolerance that used the variance with ddof=0.

vnmabus commented 1 year ago

I think that your analysis is correct.

GAA-UAM / scikit-fda

ddof parameter in FData.cov #555

First case

Second case (https://github.com/GAA-UAM/scikit-fda/blob/415796bd08fedff373961aa3c287902364d70233/skfda/ml/clustering/_kmeans.py#L150)