Closed adamltyson closed 1 year ago
More details - https://forum.image.sc/t/ram-overload-crashing-with-brokenpipeerror-errno-32-broken-pipe/74959?
The new rc (0.3.1rc1) in combination with setting --n-free-cpus
to a low number helps the issue, but doesn't fix it. cellfinder-core
is still using more memory than necessary.
The issue occurs on Ubuntu 22.04, but not on macOS (with M1).
Windows also seems to be ok. May be due to using multiprocessing spawn
mode which I think is default on Windows and macOS, but not Linux. Will allowing fork
on Linux work ok?
https://github.com/pythonprofilers/memory_profiler might be helpful for debugging this
My lab recently started working with whole brain lightsheet data, and we observed the same issue on three separate windows machines with clean cellfinder installs. I did some digging around and I think I have figured out what is causing the increased memory use.
tl;dr cellfinder-core 0.3.1 is loading every plane into memory in the background.
In detect.main#131, we create
If we tell cellfinder to run detection on only a small number of planes (e.g., 50), we can observe all planes being loaded into memory in the background. Eventually the producer threads finish all the tasks in the queue and go idle (they will persist until consumer task is done, because the consumer task is initiated in the context of the pool). From here we see memory gradually decline as results from the queue are consumed.
Similarly, if we use only a single producer thread, memory use remains constant, as the consumer immediately eats every plane as soon as the producer finishes processing it. Interestingly, the estimated ETA with a single producer is substantially lower (6 vs 11 hours). We noticed that with many producers (~22), they spend most of their time idle. This is likely due to the parent thread being unable to do scheduling while busy with the consumer task. With only a single producer there may be less multiprocessing overhead, so the parent thread can spend more time running the consumer task,
On the subject of runtime -- is the consumer task single threaded because it consumes _ball_zsize planes? Would it be possible to multithread the 3D filter by giving each child process _ball_zsize planes, processing them, and then doing the 3D filter in the child process?
The behavior we are seeing is actually more similar to: https://github.com/brainglobe/cellfinder-core/issues/52
Thanks for the clear report! I haven't checked carefully myself, but the 2D filter running faster than the 3D filter and filling up memory seems like a very likely cause. I think the solution there is to limit the maximum number of planes loaded into memory, by forcing the 2D jobs to wait for the 3D filter to consume the planes.
I had a quick play around yesterday afternoon, and got in a bit of a tangle. I think it's going to require a few of us in the Brainglobe team to sit down and work out the best way to solve this, so there probably won't be a quick fix (sorry 😢 ).
re. the consumer task, I think it might be possible to multithread it, but potentially at the cost of increasing memory consumption. We can have a think about that too though.
I started prototyping a potential solution today, based on my understanding of the current code operates (please correct any misunderstandings). The big change is to cache a constant number of planes (max_planes), and pass _ball_zsize planes to the 3D filter. We use the same worker pool for both tasks, so we need to cache _n_processes * ball_zsize planes. The current solution does end up duplicating some effort, since each caching step needs to reload some of the planes already processed to avoid boundary issues. This is probably fixable, but I wanted to focus on getting something working first.
This prototype addresses the memory use issue by ensuring a constant number of planes occupy memory at one time (configurable through how many processes we assign to the task). It also addresses the issue of the parent thread being unable to perform scheduling for children by offloading 3D filter to child tasks. Lastly, it should provide a dramatic speed improvement by allowing multiple threads to perform 3D filtering step in a parallel manner.
I'm only showing the changes to detect.main b/c the only other change is to volume_filter.py to convert from reading from a queue to a list. This code seems to correctly spawn threads and process planes, and memory use seems reasonable. I haven't been able to verify the output because for some reason python isn't able to pickle something while trying to run results.get(). Posting here to get feedback on the solution and see if anyone has insight into the pickling problem.
mp_ctx = multiprocessing.get_context("spawn")
n_processes = 4 #this will depend on memory capacity of system
max_planes = n_processes * ball_z_size
with mp_ctx.Pool(n_processes) as worker_pool:
planes = [None] * max_planes
results = [None] * max_planes
to_return = []
counter = 0
while counter < len(signal_array):
counter = max(0, counter-ball_z_size) #sets counter back ball_z_size planes
#in case where len(signal_array)%max_planes != 0, avoids trying to access invalid indices
end = min(max_planes, len(signal_array)-counter)
for i in range(0, end):
planes[i] = worker_pool.apply_async(mp_tile_processor.get_tile_mask, args=(signal_array[counter],))
counter += 1
planes = [plane.get() for plane in planes] #blocks until all results are ready
for i in range(0, max_planes-ball_z_size):
results[i] = worker_pool.apply_async(mp_3d_filter.process, args=(planes[i:i+ball_z_size]), callback=callback)
for result in results:
#blocks until all results are ready, will need to do something to combine piecewise
to_return.append(result.get())
EDIT: include pickling error
File "C:\Users\
Thanks for the reply and the code, and sorry for the slow reply from our end. You might have seen elsewhere in the repo that we decided to embark on a re-write of the cell detection code from Cython to pure Python, accellerated using Numba. One of the motivations for this is we're hoping to port the multiprocessing implementation to dask, which will allow much more flexible control of how the multiprocessing is done (including limiting memory use) while pushing the maintenance burden of the multiprocessing code upstream to dask.
We've just finished doing the Cython > Python re-write, so I'd hazard a guess at ~two months for us to finish converting the multiprocessing to dask, and putting out a new release.
@n-garc I think I (some what spontaneously!) came up with a simple solution to the issue at https://github.com/brainglobe/cellfinder-core/pull/139 - would you be able to test the code in that PR out and see if it fixes your memory issues?
@n-garc I think I (some what spontaneously!) came up with a simple solution to the issue at #139 - would you be able to test the code in that PR out and see if it fixes your memory issues?
I will try it out and see if that fixes it.
@dstansby I apologize for the delay, but I finally got around to testing #139. It does appear to solve the memory issue, but it also seems to have introduced a new bug. Unless I implemented your changes incorrectly, only one thread is active at a time using the updated code. Below you can see the 20+ threads cellfinder spawns, and only 1 child thread (+ the parent thread) is using any CPU. The active thread hops around each time a new plane is loaded, but there is never more than one thread active. As I noted previously, there isn't much point processing planes faster than the parent thread can consume them (which is currently single-threaded), so this behavior might be ok. A simpler way to achieve it would be to just set n_processes=1.
Thanks for the testing, I managed to reproduce the same issue locally. To fix that I've updated https://github.com/brainglobe/cellfinder-core/pull/139 so:
n_procs + ball_z_size
planes are unlocked for 2D filteringn_procs
planes ahead of where the 3D filter is at are unlocked for 2D filteringThis should mean that
n_procs + ball_z_size
planes are loaded into memoryn_procs
is utilised to do the 2D filteringI can't reproduce your original case, which I think was 3. above (2D filtering faster than 3D filtering), so could you give the current code on https://github.com/brainglobe/cellfinder-core/pull/139 another go to see if it fixes your original issue?
@dstansby This does appear to fix memory use without breaking threading. Once the n_procs + ball_z_size planes are loaded, only about 1-2 threads are needed to keep the queue full. I don't have the documentation handy, but is n_processes configurable for the detection step? It could further help with memory management to not need to keep many threads active (I have a 24 core machine), since there is up to a 2.2GB overhead for each thread (on my machine).
@n-garc n_processes
isn't directly configurable, but n_free_cpus
is, so the number of CPU cores used is based on the number your system has, minus a configurable number to spare. cellfinder also should make sure the RAM of the machine isn't exceeded (using imlib), but this relies on the rest of the detection working properly.
There have been a few reports of cellfinder requiring far too much RAM (100GB+). This seems to be new and should be investigated.