Making CubicSmoothingSpline smoothing parameter independent of xdata scale

shusheer commented 3 years ago

The smoothing parameter for CubicSmoothingSpline depends nontrivially on the scale of xdata. The same value of smooth yields totally different results if you simply multiply all your xdata values by 10, for example.

CubicSmoothingSpline._compute_smooth has this comment, also in the docs:

"The default value of p makes ptrace(A) equal (1 - p)trace(B)."

However the implementation is this:

return 1. / (1. + trace(a) / (6. * trace(b)))

Therefore the default value of p makes ptrace(A) equal (1 - p)trace(B)/6

Note the factor of 6

The factor of 6 is then undone immediately afterwards in CubicSmoothingSpline._make_spline

I think the magic number 6 is because of the scalar of a 2nd derivative coefficients, but it doesn't really matter why it is 6 for this discussion.

It transpires that the perceived quality of smoothing of the data varies in a non-trivial manner depending on the scale of x values, for a given setting of the smoothing parameter p. This means it is hard to generalize what level of smoothing is appropriate for a particular set of data.

However, when p is always calculated by the default method, the perceived quality of smoothing is independent of the scale of x values.

This gives a hint as to how to improve the setting of smoothness:

Define smoothness as:

/ (1. + trace(a) / (exp(10s+ln(6)-5) trace(b)))

Where s is a value from 0 to 1, and is now the same default smoothing when s=0.5

The magic number 10 is picked so that you get a decent range of possible smoothness values over the s=[0,1] interval, and half that is subtracted with -5 so that the midpoint of the s=[0,1] interval equals the current magic number of 6.

This could be implemented by adding a further parameter to the initialization for this "smoothness of the default smoothing", which would create parameter confusion, or simply replace the meaning of the "smooth" parameter with this "s" parameter, which would likely break existing code where people have carefully picked the correct smoothing value for their application.

Instead, we test the type of the smooth parameter, and bury this new s parameter in a different type, as a string that gets converted to a float. This avoids breaking existing code, which must pass either None or a value that coerces to float, and is easily tested for in CubicSmoothingSpline._make_spline by catching an exception in the calculation of pp.

shusheer commented 3 years ago

Alternative implementation, to avoid any more "magic numbers" and provide full range of possible p values:

We can simply blend together two post-sigmoid values, the default p calculation, and the new smooth parameter which is now defined such that a smooth value of 0.5 corresponds to the default p.. This also ensures that the full range of possible p values is still achievable (previous suggestion implicitly clipped these with the "magic number" 10).

def _compute_smooth(a, b, scalefactor=None):
    """
    The calculation of the smoothing spline requires the solution of a
    linear system whose coefficient matrix has the form p*A + (1-p)*B, with
    the matrices A and B depending on the data sites x. The default value
    of p makes p*trace(A) equal (1 - p)*trace(B).
    If scalefactor coerces to float, the default value of p is adjusted
    by an additional smoothing factor in the range 0..1
    """

    def trace(m: sp.dia_matrix):
        return m.diagonal().sum()

    p_initial = 1. / (1. + trace(a) / (6. * trace(b)))

    try:
        sf = float(scalefactor)
        if sf <= 0.: return 0.
        if sf >= 1.: return 1.
        scale_initial = np.log(p_initial/(1.-p_initial))
        scale_modifier = np.log(sf/(1.-sf))
        p_modified = 1./(1.+np.exp(-1*(scale_initial+p_modified)))
        return p_modified
    except Exception:
        return p_initial

espdev commented 3 years ago

@shusheer thanks for contributing.

Smoothing parameter does not depend on X scale is a good enhancement!

However, maybe we can use an additional parameter for this functionality instead of string p parameter and handling exceptions?

shusheer commented 3 years ago

Yes, I feel "icky" about stuffing effectively two different kinds of "smoothing" parameter into the same parameter value depending on being a string or whatever. It's hacky.

However we don't want another "smoothing" parameter, as that'll cause confusion.

How about this: we have a bool that sets "trace-normalized smoothing", which defaults to False. If True, then the smoothing parameter is used in the manner of my second suggestion - i.e. we effectively calculate the p from trace ratio as one sigmoid value, and then apply the smooth parameter as a second sigmoid to that.

The issue will then be how we pass this through the general csaps object, because it appears that this only really applies to CubicSmoothingSpline

espdev commented 3 years ago

However we don't want another "smoothing" parameter, as that'll cause confusion.

Of course, I agree. We can use some bool parameter for this just like you say. We need to keep backward compatibility for existing users.

shusheer commented 3 years ago

Ok, so I can see exactly how to implement this for CubicSmoothingSpline, but as I'm not using any of the other types for my application, I haven't bothered to think deeply about e.g. the NDGrid case. From looking at the code, I think you're just repetitively calling CubicSmoothingSpline, with a vector of possible smooth values, so setting a global "normalizedsmooth" bool should just work. I also don't see anywhere else in the code that auto-smoothing-parameters are calculated. If you can confirm that I only need to implement this in CubicSmoothingSpline, I'll update my code accordingly and push.

espdev commented 3 years ago

I also don't see anywhere else in the code that auto-smoothing-parameters are calculated. If you can confirm that I only need to implement this in CubicSmoothingSpline, I'll update my code accordingly and push.

You are right. We need to implement this in CubicSmoothingSpline and add the parameter to NdGridCubicSmoothingSpline (it just passing to call of CubicSmoothingSpline inside).

Also it would be nice to add tests. :)

shusheer commented 3 years ago

In building tests for this, and looking more deeply at what trace(a) and trace(b) are for the ratio that sets autosmoothing p, I think I've found a problem with the way autosmoothing works. Therefore if we are to provide some change to the way smoothing is specified, I think we might not want to use the present autosmoothing choice as the basis.

Looking at _compute_smooth and _make_spline:

trace(a) is equivalent to np.sum(2 (dx[1:] + dx[:-1])) which is almost equivalent to 4np.ptp(x)

In other words, trace(a) represents the span of the x-values.

trace(b) is almost equivalent to np.sum(np.square(1./np.diff(x))/w[:-1])

In order words, trace(b) represents how closely space the data are, to the second power.

Therefore, trace(b) will scale approximately with 1/(trace(a)^2), therefore the ratio of trace(a) to trace(b) scales with span^3 as we change the span of the x values.

This scaling with span^3 gives the same smoothing effect as we change the span of the data, so in this respect the autosmoothing parameter is working.

However, trace(b) is massively weighted towards the smallest gaps between points. For uniformly spaced points, this doesn't really matter, but for non-uniformly spaced points, this is a big problem, as trace(b) becomes a function of the smallest x-distance between points.

Therefore the actual amount of smoothing you get for autosmoothing crucially depends on the magnitude of smallest gaps in your data.

So I think that the whole mechanism of autosmoothing might be fatally flawed for datasets exhibiting any significant level of clumping in the data - which is a very common use-case for a smoothing function.

Below is example code showing the problem, where we have a single point that is very close to the next point, and utterly dominates the choice of autosmoothing parameter. It also shows that the "correct" smoothing parameter, chosen on the basis of uniformly spaced points, correctly smooths the data (i.e. some closely spaced points in the dataset doesn't interfere with smoothing, only with the choice of autosmoothing parameter).

Clearly, the smoothing parameter needs to scale effectively with np.ptp(x)^3 in some sense, as this is what the autosmoothing parameter does, and this yields the same level of smoothing as we change x scale. However, the autosmoothing parameter should not be dominated by the closest points - if we want to somehow set smoothing on the basis of the x-data and the weights, and therefore on the basis of clumpiness, then a more useful metric might be something proportional to some power of the average spacing between points.

So, question: do we want to invent an entirely new autosmoothing algorithm? Following the previous logic, the user would specify some kind of "normalizedsmoothing" bool, and then the relative smoothness they request would be independent of span of x-values, and hopefully in some sense independent of the weights and of the clumping of the data. This is what the current autosmoothing choice attempts to do, but it fails for clumpy data.

import numpy as np
import matplotlib.pyplot as plt

from csaps import csaps,CubicSmoothingSpline

import numpy_indexed as npi

np.random.seed(1234)

x = np.linspace(-5., 5., 25)

# Replace the x-value of the middle datapoint with almost that of the next datapoint
# These are at the peak of a curve, so the y-values should be very similar (within noise)
x_close = x.copy()
x_close[x.size//2] = x[1+x.size//2]-(x[1+x.size//2]-x[x.size//2])/10000

x = np.linspace(-5., 5., 25)
y = np.exp(-(x/2.5)**2) + (np.random.rand(25) - 0.2) * 0.3

# higher resolution spacing to show what happens
xs = np.linspace(x[0], x[-1], 150)

# normal autosmoothing on uniformly space data - provides the y data and the smoothing factor used
ys,sm = csaps(x, y, xs)

# normal autosmoothing on the nonuniformly spaced data - provides the y data and the smoothing factor used
ys_close,sm_close = csaps(x_close, y, xs)

# defined smoothing on the nonuniformly spaced data, using the smoothing parameter from the uniform spacing
ys_close_corrected = csaps(x_close, y, xs, smooth=sm)

print("Uniform smoothing parameter:",sm,"\nNonuniform smoothing parameter:",sm_close)

plt.plot(x, y, 'o', xs, ys, '-', xs, ys_close, '-', xs, ys_close_corrected, '-')
plt.show()

espdev commented 3 years ago

Good investigation!

In real cases the data such as x_close can be pre-processed, and closed values can be filtered. Also in some cases multivariate smoothing can be used. But anyway it would be nice to have a robust auto-smoothing algorithm in the package! Moreover the package may contain several algorithms which fit different data best (although it would be great to have something that minimally depend on X values and independent of X values scale).

shusheer commented 3 years ago

Based on the insight above, we wish to set smoothing parameter p:

p=(1./(1.+K)) For this we want a function K that takes account of x values and weights at each x value, scales with np.ptp(x)^3 and doesn't change too dramatically as we change clumpiness of the data, but increases the strength of smoothing (i.e. sets p closer to 0) when we have lots of missing data.

Regarding data clumping, there are three extreme cases:

uniformly distributed x-values
uniform x values with a single massive gap
uniform x values with a single tiny gap

This seems to work:

K = np.ptp(x)*np.sum(np.square(np.diff(x)*w[:-1]))/42
p = 1. / (1. + K)

The value of K scales with the third power of the x span, as is a desirable property of the current autosmoother. The clumping of data is somewhat taken into account, in that tiny gaps are completely ignored by the second power of the np.diff(x), but large gaps in the data lead to higher values of K, roughly in proportion to the number of "missing" points if they were supposed to be uniformly distributed, resulting in smaller values of p and therefore stronger smoothing. The weights are taken into account similarly to the current autosmoother.

The magic number 42 was discovered by trial and error, and was close enough to the hitchhikers' guide number that it simply had to be used ;)

I don't have a good feeling for the range of different applications that users might apply this code to, so I don't really know if this is a good method of choosing K (and therefore p), but it works for the few use-cases I tried.

Assuming we have some function like the above that tries to estimate a certain level of smoothing for a given set of x data and weights, then I think we use the earlier discussed approach of allowing the user to tune the level of smoothing with an additional user-specified "smooth" parameter:

    if smooth is None: return p
    if p>=1.: return 1.
    if p<=0.: return 0.
    scale_initial = np.log(p/(1.-p))
    scale_modifier = np.log(smooth/(1.-smooth))
    p = 1./(1.+np.exp(-1*(scale_initial+scale_modifier)))
    return p

shusheer commented 3 years ago

Ok, I've introduced a new "normalizedsmooth" argument package-wide, which causes CubicSmoothingSpline to change the way smoothing is calculated when set to True. When False, the code paths are the same as previously.

When normalizedsmooth=True, a smoothing value K is calculated, based on the xdata and weights. The user provided 'smooth' value, following the existing semantics of [0,1] range, is taken as smooth=0.5 if smooth==None. p is then calculated as p = s/(s+(1-s)*K) This is mathematically the same as the long-winded exps and logs above where we blend together two sigmoid variables, such that 0 means p=0, 1 means p=1, and 0.5 means we effectively just use K to calculate p.

As previously discussed, K scales with the cube of the span of the data, so that the results of smoothing are independent of the xdata scale and location. In addition, K scales with 1/cube of the number of data points, such that if you keep adding data points at the same rate, the smoothing factor does not change. This is the same behavior as the current autosmoother, but you couldn't modify the strength of the previous autosmoother, so this allows user-supplied smoothing values to be independent of scale and location of xdata, as well as being independent of the number of datapoints provided that the same spacing between points is maintained.

The above discussion is not quite correct - the number of data points used for the 1/cube relationship is modified to take account of data clumping and weighting. The "effective number of points" is calculated, based on the uniformity of spacing between points, such that very close points are effectively counted as a single effective point (we would naturally want infinitesimally close points to be averaged, rather than interpolated between). Therefore N points that are uniformly spaced, except for one infinitesimally close pair, is considered to be N-1 effective points. The corollary to this is that a very large gap between two points implies that all the other points are closely clumped together, so the effective number of points is more like two than N. A mixture of N otherwise uniformly spaced points containing M large gaps results in an effective number of points of M+1 rather than N. The same logic is applied to weights - a singular very large weight means there is only a single effective point, two large weights means two effective points, and so on. It is not assumed that there is any correlation between point weights and point spacing. The 1/cube relationship for number of points is therefore a blend of the actual number of points and the effective number of points based on point spacing and point weighting. The chosen mixture, count^2 effective_points^0.5 effective_weights^0.5 has the property of being equal to count^3 for uniformly spaced and weighted points, whilst for nonuniformly spaced or weighted points this blend appears to give sensible levels of smoothing by inspection, but might benefit from review over a range of use-cases.

There is also one "magic number" in the formula, a constant of 80, which sets the strength of smoothing at the smooth=0.5 level. This number was found by minimising the absolute value of the z-score of the Wald–Wolfowitz runs test for the test data given in the project docs, because that example should pass the IID requirement for the runs test to be valid:

x = np.linspace(-5., 5., 25)
y = np.exp(-(x/2.5)**2) + (np.random.rand(25) - 0.2) * 0.3

The strength of smoothing required for any given test problem is related to the signal:noise ratio of the data, so this is necessarily a somewhat arbitrary setting. We could try to calculate an appropriate setting by taking account of the y-values in some manner, however I decided to keep the same inputs to the autosmoother as the current implementation (i.e. just xvalues and weights), as at least this new autosmoother allows the user to adjust strength of smoothing and allow the autosmoother to correct for xvalue and weight distributions.

I have also added a test-case for the case of changing the scale of the xdata, as this should yield smoothing that gives exactly the same y-values at different x-scales, whereas the other modifications to xdata and weights lead to approximately equal rather than exactly equal results.

shusheer commented 3 years ago

Apologies, somehow my text editor had modified line endings - but only in those two files, not the others. Odd.

espdev commented 3 years ago

@shusheer,

Thank you! I will make a new release soon.

phbroberg commented 2 years ago

Hi, I am currently implementing a smoothing spline, to be used for obtaining data on the second derivative (curvature) of a spline fit. I use a spline of order 7 and penalize the 4th derivative (instead of a cubic spline with penalty on 2nd derivative as for the csaps package). I want to make the smoothing parameter independent of x data range, so I have used your normalization approach described above and adjusted it to work for a 7th order splines in the following way which seems to work:

k = 80 * (span ** 7) * (x.size ** -7)

It is used in an academic context, so I would like to cite this method of normalizing the smoothing parameter. Do you have a preferred way of citing this? or do you know some alternative references that makes use of this normalization approach?

espdev commented 2 years ago

@shusheer,

I think the question above should be addressed to you.

shusheer commented 2 years ago

I'm flattered that you think it worthy of citing!

I really just went through the analysis process described in the (rather extensive) text above. But as it's now part of the csaps package, and I make no claim over IP rights for this, perhaps just cite the package and perhaps for the URL directly link to this PR (https://github.com/espdev/csaps/pull/47)?

espdev commented 2 years ago

@shusheer I think we need to add CONTRIBUTORS file and add you to the contributors list. I'm sorry, I just forgot to do it.

@phbroberg You can see an example how to citing here: https://github.com/espdev/csaps/issues/40

shusheer commented 2 years ago

Sure, I'm happy with that. My full name for contributors file is "Shamus Husheer".

Looks like there's a bot that can help make this painless: https://allcontributors.org/docs/en/overview - I've never used it, so entirely up to @espdev if/how to go about this.

espdev commented 2 years ago

@shusheer

I have already added contributors file manually: https://github.com/espdev/csaps/blob/master/CONTRIBUTORS.txt There are not many contributors to the project to use a bot. :)

phbroberg commented 2 years ago

Hi again, Thank you for your replies. I will cite the package and link to this PR :-)

espdev / csaps

Making CubicSmoothingSpline smoothing parameter independent of xdata scale #47