OpenPIV / openpiv-python

OpenPIV is an open source Particle Image Velocimetry analysis software written in Python and Cython
http://www.openpiv.net
GNU General Public License v3.0
247 stars 142 forks source link

windef.multipass_img_deform() suddenly stops the execution of the code without throwing any error. #330

Open nepomnyi opened 1 month ago

nepomnyi commented 1 month ago

Description I am running a windef-based code for a banch of 1636 images. The code suddenly stops at a random image and doesn't show any error message.

Details This has been a persistent issue. These are my findings (briefly).

1) StackExchange suggests it can be a memory leak. Most often memory leaks are due to regular expressions which I use in my own code and which are used in windef functions. I "turned off" PIV processing part of my code and ran it with all my regular expressions - didn't have any problems. I concluded it wasn't my regular expressions that lead to the memory leak.

2) I decided the next most likely trouble-maker is multiprocessor. I decided to try and run windef without multiprocessor. I copied func() function from the piv() function of windef module and assembled a custom windef algorithm that runs in a good old for loop without multiprocessor. And I got the same problem - the code crashed at the same PIV image pair.

3) Then I noticed that I supplied PIV images with masked bubbles to my custom windef algorithm. I went ahead and the supplied the original PIV images (where the bubbles were not masked). And I had success! I concluded that, most likely, the correlation-finding algorithm of windef.multipass_img_deform() didn't like "empty" spaces in PIV images and, somehow, those empty spaces result in memory leak. By "empty" spaces, I of course mean the masked regions of the PIV images.

4) So, I supplied the non-masked (original) PIV image to my custom windef algorithm and my code could process the problematic PIV image pair (it was PIV image pair number 246). But then it stopped again at PIV image pair number 1063 without any error. I subtract background from my PIV images before supplying them to my algorithm. The background of my PIV images was varying over time (I used LED backlight that created time-varying lighting). I collected background images and picked the brightest to subtract from my PIV images. When I had a look at the PIV images with subtracted backgrounds, I noticed that pair 1063 had very "rarefied" PIV image after background subtraction. I.e., there were a lot of "empty" spaces. So, instead of subtracting the brightest background, I subtracted the average background. And I had success! PIV image pair number 1063 got processed.

5) But then, PIV image pair number 554 from a subsequent batch of 1636 images crashed the code. And again I noticed to many "empty" spaced after subtraction of the average background for this particular PIV image pair. So, I took the average background, multiplied it by a coefficient so as to bring its average intensity to the average intensity of the original PIV image from which I am subtracting it. And then I subtracted it. And I had success! The code has been running for more than 9 hours without a crash so far.

6) Based on that, I concluded that when windef.multipass_img_deform() can't find enough PIV particles within an interrogation window, a memory leaks occurs. That's where I stopped.

How to reproduce Here's the link to my Google drive where I am sharing my code and images. Please note, this is not, yet, a published code and images. I plan to publish them soon. So, please, use it solely for the purposes of finding the bug in windef and then delete it from your computers.

Following the link, you'll find a folder. Its structure is the following. Folder "backgrounds" contains 100 backgrounds for frames A and B. Folder "images" contains all the PIV pairs that crashed my code. Folder "PIVres" has the folder "temp". The folder "temp" has the following folders. "Backgrounds" folder: 1) its subfolder "PIV" contains the brightest backgrounds for frames A and B, 2) its subfolder "shadow" contains the dimmest backgrounds for frames A and B, 3) its subfolder "average" contains the average backgrounds for frames A and B. "windefBubbleMasks" folder contains bubble masks. "windefImages" contains PIV images with subtracted backgrounds. "windefResults" folder contains folder "rawvLtxtwindef" folder with the OpenPIV txt files right after the windef step and the same OpenPIV txt files but with bubble masks.

The code itself is in the root and is called "memoryLeak.py".

Copy the folder onto your computer. Navigate into the folder. Run the code from within the folder. You'll see no issues with the code.

Now, go to lines 597 and 598 of the code where I subtract the backgrounds. There, replace the argument newBgrAave with the argument bgrAave on the line 597 and the argument newBgrBave with the argument bgrBave on the line 598. And you will see that PIV image pair number 554 crashes the code without an error. The arguments newBgrAave and newBgrBave are the average backgrounds multiplied by a coefficient to make their average intensity equal to the average intensity of the row frames A and B. The arguments bgrAave and bgrBave are just the average backgrounds.

Then do the same, but use the argument bgrAPIV for newBgrAave on line 597 and the argument bgrBPIV for newBgrBave on line 598. Here the arguments bgrAPIV and bgrBPIV are the brightest backgrounds for frames A and B. You'll see how PIV image pair number 1063 crushes the code.

Then go ahead and comment out lines 625-630 and uncomment lines 615-620. On lines 625-630 I supply the original PIV images (of course with subtracted backgrounds) to my custom windef algorithm. Whereas on lines 615-620 I supply the images where bubbles are masked. You will see how PIV image pairs 246, 335, 336 crash the code.

You can play with all that and see with your own eyes how wrong background subtraction can easily create a lot of "empty" spaces (those images are in the folder "windefImages").

I do think those "empty" space lead to memory leaks not in the OpenPIV package, but in the numpy fast Fourier transform functions that cannot find correlation peaks where there are not particles in the interrogation windows.

alexlib commented 1 month ago

Thanks. it seems rather clear where to look for the problem. I am glad to see you found the main issue and can run your code for 9hrs the question is - what do you expect from the software to do when it recognizes a leak or a problematic pair - to drop it and continue? to raise an warning? to stop and wait for the solution?

nepomnyi commented 1 month ago

Yes, I agree it is clear now where to look for the problem. But it was a mere chance I figured it out.

So, I think the first thing we have to do is to update windef documentation suggesting to use either original PIV images without any masked regions or, if background is subtracted, make sure that the average background intensity is equal to the average image intensity before subtraction. This is a quick patch.

The second thing to do is to understand the mechanism of the the memory leak occurrence. And I am not sure how to do it. StackExchange has suggestions, but they are so technical. It would be great to have a real programmer for this task.

Third, like you say, we need to understand how to proceed. I think we need a normal solution here. For instance, if windef can't get a vector for the current interrogation window, then it should assign NaN to it instead of leaking memory. We also should give the user a warning that a memory leak could've occurred and that he must refine the pair of the PIV images (and give suggestions how to refine it). If we can't do all that, then dropping the image pair is next best solution, I think.

To summarize, the big question right now is how to understand where exactly memory leak occurs and how to intercept it.

alexlib commented 1 month ago

It would be great if we see which line or lines are responsible. I guess it is a loop for multigrid but this is just my guess.

will take some time to look into it.

nepomnyi commented 1 month ago

You see, when I run it, I can see it failing at the last pass of the multipass (which operates on a 16 pix interrogation window with 0% overlap). There were cases when even the first pass of the multipass crashed (which operates on a 32 pix interrogation window with 50% overlap). But the first_pass always works (which operates on a 64 pix interrogation window with 50% overlap).

When I look at windef algorithm, I can see it uses extended_search_area() under the hood. The only piece concerning with the size of the interrogation window in extended_search_area() is numpy's fft function. That's why I think the issue is on numpy's side.

alexlib commented 1 month ago

Please, create two or three different files with the names as you propose. I'm not sure I understand what you want me to replace by what.

Now, go to lines 597 and 598 of the code where I subtract the backgrounds. There, replace the argument newBgrAave with the argument bgrAave on the line 597 and the argument newBgrBave with the argument bgrBave on the line 598. And you will see that PIV image pair number 554 crashes the code without an error. The arguments newBgrAave and newBgrBave are the average backgrounds multiplied by a coefficient to make their average intensity equal to the average intensity of the row frames A and B. The arguments bgrAave and bgrBave are just the average backgrounds.

Then do the same, but use the argument bgrAPIV for newBgrAave on line 597 and the argument bgrBPIV for newBgrBave on line 598. Here the arguments bgrAPIV and bgrBPIV are the brightest backgrounds for frames A and B. You'll see how PIV image pair number 1063 crushes the code.

Then go ahead and comment out lines 625-630 and uncomment lines 615-620. On lines 625-630 I supply the original PIV images (of course with subtracted backgrounds) to my custom windef algorithm. Whereas on lines 615-620 I supply the images where bubbles are masked. You will see how PIV image pairs 246, 335, 336 crash the code.

You can play with all that and see with your own eyes how wrong background subtraction can easily create a lot of "empty" spaces (those images are in the folder "windefImages").

I do think those "empty" space lead to memory leaks not in the OpenPIV package, but in the numpy fast Fourier transform functions that cannot find correlation peaks where there are not particles in the interrogation windows.

nepomnyi commented 1 month ago

Done. When the Google drive updates itself, you'll see 3 new .py files:

For some reason, I couldn't reproduce the situation when PIV pairs number 246, 335, 336 crash the code. This is so surprising! Or maybe I forgot the exact configuration of the programme when it was crashed by pairs 246, 335, 336. Still, I left those PIV pairs in the folder so that you could compare how good PIV pairs are executed to what happens when a PIV pair crashes the code.

Also note that the function PIVanalysisWindefIndivid(), lines 352-467 is where my custom windef algorithm is located. Essentially, I just copied the function func() which is located inside the function piv() in windef module of OpenPIV package.

I put the settings for windef in a separate function PIVwindefSettings(), lines196-350.

You can check the images that, after background subtraction, are supplied to the windef algorithm in the folder PIVres -> temp -> windefImages.

nepomnyi commented 1 month ago

This thing keeps stopping and stopping without any error... Much more seldom though. I think, I have traced the issue to the implementation of scipy.interpolate.RectBivariateSpline() on lines 690 and 693 in windef module. So, I am not blaming numpy's fft anymore. I found an open issue on scipy's github page about RectBivariateSpline. The issue is not about memory leak, but still. I asked them if they have resolved the issue and they said no. I am not sure how to troubleshoot further.

alexlib commented 1 month ago

On my machine, the problem is in np.all(flags) after validation. this is explainable that when the image is completely black, or in your case the background is identical to the image - all flags are ON and there's nothing to propagate to the next multi-pass level. I still think how to deal with it. It's a degenerate case, i.e. not normal. I think it's sufficient to stop the execution. Actually this test raises an error, but probably the multiprocessing is covering it up and just stops.

nepomnyi commented 1 month ago

Have you checked they all were, actually, ON? If so, the fix is easy: an if statement which catches this thing at the current step of the multipass and, if True keeps the previous step of the multipass, gives the user a warning and moves on to the next image pair without stopping the execution. Although in my recent cases, I am pretty sure, the culprit is RectBivariateSpline().

alexlib commented 1 month ago

Have you checked they all were, actually, ON? If so, the fix is easy: an if statement which catches this thing at the current step of the multipass and, if True keeps the previous step of the multipass, gives the user a warning and moves on to the next image pair without stopping the execution. Although in my recent cases, I am pretty sure, the culprit is RectBivariateSpline().

I use debugger. It stops on the specific line. I guess that there's also an option that two different errors occur in different situations. I didn't try all the options, first I need to solve this one.

ErichZimmer commented 1 month ago

Hi all, sorry for not jumping in early on.

RectBivariateSpline seems to be the cause of this issue, as other have stated. This issue may stem from how RectBivariateSpline operates. For the splines to properly form and evaluate, there needs to be a minimum of a 3x3 regular grid of finite values for low interpolation orders (3 and lower) and a 5x5 grid of regular finite value for higher interpolation orders (4 and up). Another user posted a similar issue that occurred some time ago caused from not enough interrogation windows in the grid, which emitted an error when evaluating the spline (#228). I wish I could help further, but I will be quite busy until February, which may delay further responses and maintenance.

alexlib commented 1 month ago

@nepomnyi, please check the scripts you've sent for the test on your machine. I think there is no "special bug" or "memory leak" but the script fails due to somewhat different conditions in the background subtracted that leads to some region with a lot of NaNs. This region is then should be covered by replace_outliers which does inpainting of NaNs, but it fails in some cases to cover everything. Then, indeed we get NaN inside BivariateSpline and everything becomes NaNs and then we're doomed.

I suggest to check if this change in settings solves the problem

windefSettings.max_filter_iteration = 5 # was 3

It does solve problem for me and all the scripts run well.

There is also one problem that I think does not relate to the script of openpiv but rather the way pairs of files are chosen. I had to replace on Linux machine the line islice(pairwise(... cause it gave me pairs not in sorted way so the pairs were not a,b chosen. I use very simple way:

rom natsort import natsorted

image_list = rawPIV.glob('*.jpg')
image_list = natsorted(image_list, key=str)

# print(image_list)

# Create pairs of files from the image_list
pairs = [(image_list[i], image_list[i + 1]) for i in range(0, len(image_list) - 1, 2)]
print(pairs)

# Or convert list of paths to list of string and (naturally)sort it, then convert back to list of paths
# image_list = [Path(p) for p in natsorted([str(p) for p in image_list ])]

# for pair in islice(pairwise(rawPIV.iterdir()),None,None,2):
for pair in pairs:
    print(str(pair[0]), str(pair[1]))
alexlib commented 1 month ago

If this indeed solves the problem, we still need to think about how to protect our users from failing with this case - I think the warning would be appropriate, but we need to find out how to set it correctly and where.

nepomnyi commented 1 month ago

Thank you, Dr. Liberzon. I will increase the number of iterations to 5 and report how it goes. So far, I have "solved" the problem by increasing the threshold for the median filter to 1.3 from 1.1. Since I have done that, I haven't experienced sudden execution interruptions. I think this goes in the same vein: increasing the median thresholds significantly reduces the likelihood of getting all NaNs.

nepomnyi commented 1 month ago

@alexlib, does it matter if I check your suggestion windefSettings.max_filter_iteration = 5 # was 3 on a different failing image pair? If so, after I introduced the changes I mentioned in the message above (loosened threshold for the median filter), my code have been running on two different computers simultaneously for about 2 weeks without any interruptions. And just now, the code has stopped on one of the computers without an error message again. For the failing pair, I tried your suggestion windefSettings.max_filter_iteration = 5 # was 3 and it worked! So, I can confirm: it is a possible solution. If you do, in fact, want me to check it on those image pairs that I sent you, I can do that (but later).