cjekel / similarity_measures

Quantify the difference between two arbitrary curves in space
MIT License
246 stars 41 forks source link

frechet_dist input size is bounded by maximum recursion depth #6

Closed ArbelAmir closed 4 years ago

ArbelAmir commented 5 years ago

Consider the followings:

max_len = 1000
a = [[1,2,3] for i in range(max_len)]
b = [[1,6,3] for i in range(max_len)]
frechet_dist(a,b)

While running this code on a 32GB RAM machine it raises a stack-overflow error. I would suggest to switch the recursion based computations to iterative based computations using Queue's.

Is anyone currently working on optimizing the memory usage of frechet_dist ?

Thank you for your work, Arbel Amir

cjekel commented 5 years ago

Hi Arbel Amir,

Thank you for the code and comment. I think that people would be both interested in both optimizing memory useage and run time of Fréchet distance calculations. Any algorithms to do so would end up being a fairly significant contribution and probable a popular paper.

I don't get the RecursionError with max_len=1000 on my system with Python 3.6. Part of the problem with this is the default limit in Python is small for these calculations.

If my memory recalls, the discrete Fréchet distance is O(nm), where n is the length of a, and m is the length of b. So you can adjust the default recursion limit to get around this https://docs.python.org/3/library/sys.html#sys.setrecursionlimit , but then the calculation will take sometime.

What were you thinking on trying?

Thanks, CJ

cjekel commented 5 years ago

Is it easy to rewrite the algorithms iteratively?

SenZHANG-GitHub commented 5 years ago

The iteration code that solves this problem

def dp_frechet_dist(P, Q, pnorm=2):
    """ dynamic programming for discrete frechet distance """
    n = len(P)
    m = len(Q)
    ca = np.ones((n, m))
    ca = np.multiply(ca, -1)
    ca[0, 0] = minkowski_distance(P[0], Q[0], p=pnorm)
    for i in range(1, n):
        ca[i, 0] = max(ca[i-1, 0], minkowski_distance(P[i], Q[0], p=pnorm))
    for j in range(1, m):
        ca[0, j] = max(ca[0, j-1], minkowski_distance(P[0], Q[j], p=pnorm))
    for i in range(1, n):
        for j in range(1,m):
            ca[i, j] = max(min(ca[i-1, j], ca[i, j-1], ca[i-1, j-1]), minkowski_distance(P[i], Q[j], p=pnorm))
    return ca[n-1, m-1]

Discrete Frechet distance is essentially a DP problem that should best be coded with iteration. The original code uses ( if ca[i, j] > -1: return ca[i, j] ) in the recursive function to avoid the computation of finished nodes, which is a somewhat ok solution but leads to the stack-overflow problem for recursive functions. And in practice iteration is a little bit faster since we reduce the burden of funciton calling : )

cjekel commented 5 years ago

Thanks @SenZHANG-GitHub for the simple function call that follows the current style of the library!

Okay so the iterative code and the recursive code are both O(nm).

SenZHANG-GitHub commented 5 years ago

You are welcome : ) Yes they are both O(nm) w.r.t. the number of distance calculation and max/min functions.

cjekel commented 4 years ago

Quick study compared the performance of the recursion code to the dynamic programming (DP) code. Results showed that DP was about 1.2 times faster for five dimensional data with 1000 points on each curve (on my machine). Here's the notebook I used to come up with these numbers: https://github.com/cjekel/similarity_measures/blob/master/frechet_distance_recursion_vs_dp.ipynb

The DP algorithm also avoids Python's maximum recursion depth issues, so I think it makes sense to switch the Frechet implementation.

@SenZHANG-GitHub do you want to send a PR with this DP code replaces the recursive code (and you remove the _c function)? You can change the following docstring part to say thanks to yourself :)

"""
    Python has a default limit to the amount of recursive calls a single
    function can make. If you have a large dataset, you may need to increase
    this limit. Check out the following resources.

    https://docs.python.org/3/library/sys.html#sys.setrecursionlimit
    https://stackoverflow.com/questions/3323001/what-is-the-maximum-recursion-depth-in-python-and-how-to-increase-it

    Thanks to MaxBareiss
    https://gist.github.com/MaxBareiss/ba2f9441d9455b56fbc9

    This sets a global variable named pnorm, where pnorm = p.
"""

I think this issue will bring some nice changes, so thanks @ArbelAmir and @SenZHANG-GitHub !

SenZHANG-GitHub commented 4 years ago

Hi Charles, I'm indeed glad that it helps and yes I removed _c since it is used for the recursive function calling. Just a "thanks to" in the docstring part would be great. Thanks!

cjekel commented 4 years ago

https://github.com/cjekel/similarity_measures/commit/c83c58d75725745373d52d3d487d3f5c5c32d68b now uses the iterative DP code for the Frechet distance thanks to @SenZHANG-GitHub

Going to push to PyPi shortly.