cjekel / similarity_measures

Quantify the difference between two arbitrary curves in space
MIT License
246 stars 41 forks source link

Incorrect measurement of area between intersecting 2D curves #20

Open rafalszulejko opened 2 years ago

rafalszulejko commented 2 years ago

It appears that either my understanding of area between the curves or its calculation in the library is incorrect (the referenced paper is paywalled). In the following example, I have two plots, where grey line is original data, and there are two different blue splines. Visually, you can clearly see that the area between two lines on the left plot is several times larger than the area on the right plot, but the calculation with similaritymeasures.area_between_two_curves shows only a 2x difference. image image

Here is the GitHub gist, where I present the calculation (the GDrive zip with airfoils is public, so the whole thing can be ran in Colab or elsewhere if you modify the path in the 2nd cell): https://gist.github.com/rafalszulejko/2c9ff645b448d60d857975a8f7965045#file-wing-optimization2-ipynb

cjekel commented 2 years ago

The algorithm is not behind a paywall, and here is a copy of the paper if you want to read it.

You are not using enough discrete data points to effectively calculate the area. The method makes no assumptions about the curvature between discrete data points, other than data points can be connected with a line. This is important, because straight lines allows us to use quadrilaterals.

Let me run that script and plot the quadrilaterals so you can get a sense of what is happening. Give me an hour or so.

rafalszulejko commented 2 years ago

Thank you for the quick reply. To clarify - these big blue points are just nodes, through which the splines are interpolated. Then the spline is evaluated to produce discrete data points. The actual blue line (out in the notebook) has 101 data points, and the grey one (af_pts) has 97 - is that still not enough?

I really appreciate you looking into the script, thank you!

cjekel commented 2 years ago

When I ran your code, I noticed there were only 15 data points... Are you indeed using 100? Let me plot the quadrilaterals still to illustrate what's happening.

Edit: Ah I see now you did have more than 15 points in the comparison... It might be a sampling rate issue.

rafalszulejko commented 2 years ago

I believe you are still referring to XX and YY variables, which in the second plot indeed have the length of 15. As implicitly mentioned in my previous comment, while important to my project, this variable is unrelated to area calculation. The data points are stored in af_pts and out arrays. I am sorry, I should have stated that clearly in the first place, and the plot should better reflect the problem and not something I wanted to illustrate in my project.

I removed that information from the plot. Now blue points refer to the evaluated points from the spline (the data is unchanged). I updated the gist: https://gist.github.com/rafalszulejko/2c9ff645b448d60d857975a8f7965045#file-wing-optimization2-ipynb

image

cjekel commented 2 years ago

The quadrilaterals actually look good... seems like you are actually getting a good area value... image

cjekel commented 2 years ago

So here is a convergent study by increasing the data points. You can see the case on the right sees a more drastic change than the airfoil on the left.

I would consider using at least 200 data points.

Instead of doing a area comparison directly, you might want to compare the area difference relative to the true cross sectional area of the airfoil. This should help normalize the area comparisons.

My bisection algorithm to add more data points is very slow. If you an make both curves have the exact same number of data points, then the calculations should be quicker. ( at least I think so from memory )

Again the quadrilaterals all look good.

202 data

  unew = np.arange(0, 1.01, 0.005) #dlaczego 1.01 jest lepsze niż u[-1]???

image

1010 data

  unew = np.arange(0, 1.01, 0.001) #dlaczego 1.01 jest lepsze niż u[-1]???

image

2022 data

  unew = np.arange(0, 1.01, 0.0005) #dlaczego 1.01 jest lepsze niż u[-1]???

image

xxxx data (still running)...

  unew = np.arange(0, 1.01, 0.0001) #dlaczego 1.01 jest lepsze niż u[-1]???
cjekel commented 2 years ago

And another tip.

def interpolateAirfoilWithNodes(af_ptsx, af_ptsy, nodes):
  LE_index = np.argmin(af_ptsx)
  xx, yy = getXYnodes(af_ptsx, af_ptsy, nodes);
  tck, u = interpolate.splprep([xx, yy], s=0)
  unew = np.arange(0, 1.01, 0.01) #dlaczego 1.01 jest lepsze niż u[-1]???
  out = interpolate.splev(unew, tck)
  return out, u

Instead of picking unew evenly spaced, you may want to sample unew exactly at af_ptsx to ensure the exact same sampling rate with respect to the x axis.

rafalszulejko commented 2 years ago

Interesting - these new results more closely resemble what I have just obtained with MATLAB polyshapes. For original numbers of data points (97 and 101), I obtained 0.006484 for the left plot and 0.001548 for the right. I think I disabled every automatic correction, but I am not 100% sure this should be the 'ground truth'.

rafalszulejko commented 2 years ago

Okay, here is the gist with calculation of area in MATLAB, I just exported the data points from the notebook and put into the script. https://gist.github.com/rafalszulejko/6c961edb0cf4f1b1fc544678f90ca65d

Are you able to run it? And even if we can take its value as the ground truth, what then? I am not very familiar with how things are done in open source community. I am also not sure if I could fix this myself (and post a PR).

cjekel commented 2 years ago

I'm sorry for the delayed response. I've been completely swamped, preparing , and I'm currently at a conference. This is airfoil work is something I am interested in.

I have not had a chance to run the matlab routine.

I will just say, my method is always going to be an approximate method. One thing you can do to improve the approximations, would be to try tweaking the number of data points as I suggested. Either with more data points, or by tweaking such that the sampling rate is better.

That said I do have some new ideas on how to get a better, more 'exact' area.

Where are you on this? I expect I'll have more free time towards the end of the month.

rafalszulejko commented 2 years ago

No worries, the project I'm working on is at a stage when I can comfortably work on a different part of it. The workaround helps a lot (and I really appreciate the time you put into that problem already) for a quick estimation, though for a step equal to 0.0005 we still have 5.7% error for the plot on the right assuming MATLAB's result is correct. Also I am not sure if and how the shape of the calculated area influences the error.

If you'd like to and have time to continue investigation of this, I'll be happy to provide more "problematic" data along with results obtained with matlab.