Closed pcuestas closed 1 year ago
Note that this parameter should be also added to var
.
I understand that ddof
should also be 1 by default in var
. This is going to change the default behavior of var
, as it now normalizes by N
.
That is right. I do not think that changing it will cause problems, as it is not a very used function.
Some tests are failing because of the modification in var
.
================================== short test summary info ===================================
FAILED skfda/ml/clustering/_kmeans.py::skfda.ml.clustering._kmeans.FuzzyCMeans
FAILED skfda/tests/test_neighbors.py::TestNeighbors::test_score_functional_response - AssertionError:
FAILED skfda/tests/test_scoring.py::TestScoreFunctionsGrid::test_all - AssertionError:
FAILED skfda/tests/test_scoring.py::TestScoreFunctionGridBasis::test_all - AssertionError: 0.9859361757845293 != 0.992968013919044 within 2 places (0.00703183813451...
FAILED skfda/tests/test_scoring.py::TestScoreZeroDenominator::test_zero_r2 - AssertionError:
I am going to write ddof=0
in the calls to var
that appear in the code:
This solves the issues with the tests.
Do not "just" write ddof=0
. It would be best to analyze the intended denominator in each place.
Sorry, I thought those functions were already intentionally using $1/N$ as the variance normalization.
A function _var
is defined: https://github.com/GAA-UAM/scikit-fda/blob/415796bd08fedff373961aa3c287902364d70233/skfda/misc/scoring.py#L70-L82 and it is only used in two places.
Here:
https://github.com/GAA-UAM/scikit-fda/blob/415796bd08fedff373961aa3c287902364d70233/skfda/misc/scoring.py#L245-L250 where the normalization coefficient is not relevant, due to the division in line 250.
And here: https://github.com/GAA-UAM/scikit-fda/blob/415796bd08fedff373961aa3c287902364d70233/skfda/misc/scoring.py#L1011-L1020 where the stats.var
function is called iff sample_weight=None
, in which case the variable ss_res
is also normalized by N
by mean
, so I think ddof
should be 0 in this first case.
Here the variance is used only to calculate the tolerance used by the K-Means algorithm to check for convergence:
I believe that ddof=0
can be acceptable here but I do not have any arguments against ddof=1
other than the fact that the tests were built taking into account the previous definition of tolerance that used the variance with ddof=0
.
I think that your analysis is correct.
The current behavior of the FData.cov method is to normalize the covariance by $N-1$ (default normalization applied by
np.cov
). This limits the functionality of the covariance function. In order to provide a more flexible covariance method, a parameter should be added so that this normalization coefficient can be specified by the user.Describe the solution you'd like Add a parameter
ddof
to FData.cov (by defaultddof=1
) such that the normalization function applied by FData.cov isN-ddof
.