GAA-UAM / scikit-fda

Functional Data Analysis Python package
https://fda.readthedocs.io
BSD 3-Clause "New" or "Revised" License
290 stars 51 forks source link

Computed value of l2 distance differs between FData and np.array #441

Closed aberges-grd closed 2 years ago

aberges-grd commented 2 years ago

Describe the bug Assume we have 2 arrays describing 2 smooth functions. When creating an FDatagrid object with support = range(100), the computed euclidean (l2) distance is not the same when I use ld_distance on the fdatagrid objects and when I use it on the numpy arrays directly.

To Reproduce Code to reproduce the behavior:

import skfda
from skfda.misc.metrics import l2_distance
from scipy.spatial.distance import euclidean
import numpy as np

# some data.
f1 = np.array([ 0.25333785,  0.21372725,  0.17809237,  0.14629781,  0.11820955,
        0.09369503,  0.07262312,  0.0548641 ,  0.04028969,  0.02877304,
        0.02018872,  0.01441273,  0.01132251,  0.01079692,  0.01271625,
        0.01696221,  0.02341795,  0.03196805,  0.04249851,  0.05489676,
        0.06905166,  0.0848535 ,  0.102194  ,  0.12096631,  0.141065  ,
        0.16238607,  0.18482695,  0.20828651,  0.23266504,  0.25786425,
        0.28378729,  0.31033873,  0.33742457,  0.36495226,  0.39283064,
        0.42097001,  0.44928209,  0.47768001,  0.50607837,  0.53439315,
        0.56254179,  0.59044315,  0.61801753,  0.64518663,  0.6718736 ,
        0.69800302,  0.72350089,  0.74829464,  0.77231314,  0.79548666,
        0.81774694,  0.83902711,  0.85926174,  0.87838685,  0.89633987,
        0.91305964,  0.92848647,  0.94256206,  0.95522957,  0.96643357,
        0.97612007,  0.98423648,  0.99073168,  0.99555595,  0.99866101,
        1.        ,  0.9995275 ,  0.99719952,  0.99297347,  0.98680823,
        0.97866408,  0.96850274,  0.95628735,  0.94198249,  0.92555416,
        0.9069698 ,  0.88619825,  0.86320981,  0.8379762 ,  0.81047057,
        0.78066747,  0.74854293,  0.71407437,  0.67724064,  0.63802205,
        0.5964003 ,  0.55235854,  0.50588134,  0.45695471,  0.40556608,
        0.3517043 ,  0.29535967,  0.2365239 ,  0.17519014,  0.11135295,
        0.04500835, -0.02384623, -0.09521194, -0.16908849, -0.24547416])
f2 = np.array([ 0.08696421,  0.16772222,  0.24298569,  0.31300073,  0.37800802,
        0.43824287,  0.49393516,  0.54530939,  0.59258466,  0.63597463,
        0.67568762,  0.71192649,  0.74488874,  0.77476644,  0.80174628,
        0.82600952,  0.84773206,  0.86708436,  0.88423149,  0.89933313,
        0.91254355,  0.92401161,  0.93388077,  0.94228912,  0.94936929,
        0.95524856,  0.96004879,  0.96388643,  0.96687254,  0.96911276,
        0.97070736,  0.97175118,  0.97233368,  0.97253889,  0.97244546,
        0.97212664,  0.97165026,  0.97107878,  0.97046921,  0.96987321,
        0.969337  ,  0.96890142,  0.9686019 ,  0.96846846,  0.96852574,
        0.96879296,  0.96928395,  0.97000712,  0.9709655 ,  0.9721567 ,
        0.97357295,  0.97520106,  0.97702243,  0.97901309,  0.98114365,
        0.9833793 ,  0.98567987,  0.98799974,  0.99028793,  0.99248804,
        0.99453827,  0.99637141,  0.99791486,  0.99909061,  0.99981526,
        1.        ,  0.99955062,  0.9983675 ,  0.99634563,  0.99337459,
        0.98933857,  0.98411634,  0.97758129,  0.96960139,  0.96003922,
        0.94875195,  0.93559135,  0.92040379,  0.90303024,  0.88330626,
        0.86106203,  0.8361223 ,  0.80830643,  0.77742839,  0.74329673,
        0.70571461,  0.66447977,  0.61938459,  0.57021599,  0.51675555,
        0.45877939,  0.39605827,  0.32835753,  0.25543711,  0.17705155,
        0.09294999,  0.00287617, -0.09343157, -0.19624031, -0.30582252])
# make fdatagrid
data = skfda.FDataGrid(
    data_matrix=[f1,f2],
    grid_points=range(100)
)
# distance computations
l2_distance(data[1], data[0])  # 4.25420004
l2_distance(f1, f2)            # 4.25631899
euclidean(f1, f2)              # 4.25631899

Expected behavior Given that the support given that l2_distance is equivalent to the euclidean distance for the example given (domain of the function is the range 0..99), I'd expect l2_distance to give the same value as when you call it on numpy arrays.

Version information

vnmabus commented 2 years ago

The small difference observed could be explained because the weights used in the Simpson quadrature for FDataGrid are not uniform:

import scipy.integrate
np.sqrt(np.sum((f1 - f2)**2))  # 4.256318990810707

weights = scipy.integrate.simpson(np.eye(100))  # weights are [0.41666667, 1.08333333, 1, 1, ..., 1, 1, 1.08333333, 0.41666667]

np.sqrt(np.sum((f1 - f2)**2 * weights))  # 4.2542000336339605

This difference becomes smaller as the number of grid points grow, because the only four points with weights different to 1 are irrelevant in a large sum.

aberges-grd commented 2 years ago

Ah, I see. I was reporting this because in my case, the small differences were enough to change the result of an agglomerative clustering algorithm (where the skfda distances give a better clustering). So I was very confused.

As an edit (for completion's sake), scipy's euclidean accepts a w parameter that does just that weighting.

...
euclidean(f1, f2, weights) # 4.254200037595696