Investigate memory usage

adamltyson commented 1 year ago

There have been a few reports of cellfinder requiring far too much RAM (100GB+). This seems to be new and should be investigated.

adamltyson commented 1 year ago

More details - https://forum.image.sc/t/ram-overload-crashing-with-brokenpipeerror-errno-32-broken-pipe/74959?

The new rc (0.3.1rc1) in combination with setting --n-free-cpus to a low number helps the issue, but doesn't fix it. cellfinder-core is still using more memory than necessary.

adamltyson commented 1 year ago

The issue occurs on Ubuntu 22.04, but not on macOS (with M1).

adamltyson commented 1 year ago

Windows also seems to be ok. May be due to using multiprocessing spawn mode which I think is default on Windows and macOS, but not Linux. Will allowing fork on Linux work ok?

dstansby commented 1 year ago

https://github.com/pythonprofilers/memory_profiler might be helpful for debugging this

n-garc commented 1 year ago

My lab recently started working with whole brain lightsheet data, and we observed the same issue on three separate windows machines with clean cellfinder installs. I did some digging around and I think I have figured out what is causing the increased memory use.

tl;dr cellfinder-core 0.3.1 is loading every plane into memory in the background.

In detect.main#131, we create producer threads. By using async_apply, which immediately returns an async object, we create a queue containing n_planes async objects. The for loop exits, and we start our consumer task (mp_3d_filter.process), which runs in the parent thread. In the background, our pool of producers begin chewing through the tasks in the queue. Each time a plane is processed by the producer threads, a dask array is converted to an np array, requiring it to be loaded into memory. The producer threads are able to produce planes faster than the consumer can consume them. This means that np arrays are building up in memory.

memuse_many_planes

If we tell cellfinder to run detection on only a small number of planes (e.g., 50), we can observe all planes being loaded into memory in the background. Eventually the producer threads finish all the tasks in the queue and go idle (they will persist until consumer task is done, because the consumer task is initiated in the context of the pool). From here we see memory gradually decline as results from the queue are consumed.

memuse_few_planes

Similarly, if we use only a single producer thread, memory use remains constant, as the consumer immediately eats every plane as soon as the producer finishes processing it. Interestingly, the estimated ETA with a single producer is substantially lower (6 vs 11 hours). We noticed that with many producers (~22), they spend most of their time idle. This is likely due to the parent thread being unable to do scheduling while busy with the consumer task. With only a single producer there may be less multiprocessing overhead, so the parent thread can spend more time running the consumer task,

memuse_single_producer

On the subject of runtime -- is the consumer task single threaded because it consumes _ball_zsize planes? Would it be possible to multithread the 3D filter by giving each child process _ball_zsize planes, processing them, and then doing the 3D filter in the child process?

n-garc commented 1 year ago

The behavior we are seeing is actually more similar to: https://github.com/brainglobe/cellfinder-core/issues/52

dstansby commented 1 year ago

Thanks for the clear report! I haven't checked carefully myself, but the 2D filter running faster than the 3D filter and filling up memory seems like a very likely cause. I think the solution there is to limit the maximum number of planes loaded into memory, by forcing the 2D jobs to wait for the 3D filter to consume the planes.

I had a quick play around yesterday afternoon, and got in a bit of a tangle. I think it's going to require a few of us in the Brainglobe team to sit down and work out the best way to solve this, so there probably won't be a quick fix (sorry 😢 ).

re. the consumer task, I think it might be possible to multithread it, but potentially at the cost of increasing memory consumption. We can have a think about that too though.

n-garc commented 1 year ago

I started prototyping a potential solution today, based on my understanding of the current code operates (please correct any misunderstandings). The big change is to cache a constant number of planes (max_planes), and pass _ball_zsize planes to the 3D filter. We use the same worker pool for both tasks, so we need to cache _n_processes * ball_zsize planes. The current solution does end up duplicating some effort, since each caching step needs to reload some of the planes already processed to avoid boundary issues. This is probably fixable, but I wanted to focus on getting something working first.

This prototype addresses the memory use issue by ensuring a constant number of planes occupy memory at one time (configurable through how many processes we assign to the task). It also addresses the issue of the parent thread being unable to perform scheduling for children by offloading 3D filter to child tasks. Lastly, it should provide a dramatic speed improvement by allowing multiple threads to perform 3D filtering step in a parallel manner.

I'm only showing the changes to detect.main b/c the only other change is to volume_filter.py to convert from reading from a queue to a list. This code seems to correctly spawn threads and process planes, and memory use seems reasonable. I haven't been able to verify the output because for some reason python isn't able to pickle something while trying to run results.get(). Posting here to get feedback on the solution and see if anyone has insight into the pickling problem.

    mp_ctx = multiprocessing.get_context("spawn")
    n_processes = 4 #this will depend on memory capacity of system
    max_planes = n_processes * ball_z_size
    with mp_ctx.Pool(n_processes) as worker_pool:
        planes = [None] * max_planes
        results = [None] * max_planes
        to_return = []

        counter = 0
        while counter < len(signal_array):
            counter = max(0, counter-ball_z_size) #sets counter back ball_z_size planes 
            #in case where len(signal_array)%max_planes != 0, avoids trying to access invalid indices
            end = min(max_planes, len(signal_array)-counter) 
            for i in range(0, end):
                planes[i] = worker_pool.apply_async(mp_tile_processor.get_tile_mask, args=(signal_array[counter],))
                counter += 1

            planes = [plane.get() for plane in planes] #blocks until all results are ready
            for i in range(0, max_planes-ball_z_size):
                results[i] = worker_pool.apply_async(mp_3d_filter.process, args=(planes[i:i+ball_z_size]), callback=callback)

            for result in results:
                #blocks until all results are ready, will need to do something to combine piecewise 
                to_return.append(result.get())

EDIT: include pickling error

File "C:\Users\\anaconda3\envs\iblenv\lib\site-packages\cellfinder_core\detect\detect.py", line 298, in main to_return.append(result.get()) File "C:\Users\\anaconda3\envs\iblenv\lib\multiprocessing\pool.py", line 771, in get raise self._value File "C:\Users\\anaconda3\envs\iblenv\lib\multiprocessing\pool.py", line 537, in _handle_tasks put(task) File "C:\Users\\anaconda3\envs\iblenv\lib\multiprocessing\connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "C:\Users\\anaconda3\envs\iblenv\lib\multiprocessing\reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "stringsource", line 2, in View.MemoryView._memoryviewslice.__r

dstansby commented 1 year ago

Thanks for the reply and the code, and sorry for the slow reply from our end. You might have seen elsewhere in the repo that we decided to embark on a re-write of the cell detection code from Cython to pure Python, accellerated using Numba. One of the motivations for this is we're hoping to port the multiprocessing implementation to dask, which will allow much more flexible control of how the multiprocessing is done (including limiting memory use) while pushing the maintenance burden of the multiprocessing code upstream to dask.

We've just finished doing the Cython > Python re-write, so I'd hazard a guess at ~two months for us to finish converting the multiprocessing to dask, and putting out a new release.

dstansby commented 1 year ago

@n-garc I think I (some what spontaneously!) came up with a simple solution to the issue at https://github.com/brainglobe/cellfinder-core/pull/139 - would you be able to test the code in that PR out and see if it fixes your memory issues?

n-garc commented 1 year ago

@n-garc I think I (some what spontaneously!) came up with a simple solution to the issue at #139 - would you be able to test the code in that PR out and see if it fixes your memory issues?

I will try it out and see if that fixes it.

n-garc commented 1 year ago

@dstansby I apologize for the delay, but I finally got around to testing #139. It does appear to solve the memory issue, but it also seems to have introduced a new bug. Unless I implemented your changes incorrectly, only one thread is active at a time using the updated code. Below you can see the 20+ threads cellfinder spawns, and only 1 child thread (+ the parent thread) is using any CPU. The active thread hops around each time a new plane is loaded, but there is never more than one thread active. As I noted previously, there isn't much point processing planes faster than the parent thread can consume them (which is currently single-threaded), so this behavior might be ok. A simpler way to achieve it would be to just set n_processes=1.

memory_use_139

cellfinder_cpu_use

dstansby commented 1 year ago

Thanks for the testing, I managed to reproduce the same issue locally. To fix that I've updated https://github.com/brainglobe/cellfinder-core/pull/139 so:

At the start, n_procs + ball_z_size planes are unlocked for 2D filtering
When the 3D filtering starts, n_procs planes ahead of where the 3D filter is at are unlocked for 2D filtering

This should mean that

At any one time, a maximum of n_procs + ball_z_size planes are loaded into memory
If 3D filtering is faster than 2D filtering, the full parallelism of n_procs is utilised to do the 2D filtering
If 2D filtering is faster than 3D filtering, the number of planes in memory is still limited

I can't reproduce your original case, which I think was 3. above (2D filtering faster than 3D filtering), so could you give the current code on https://github.com/brainglobe/cellfinder-core/pull/139 another go to see if it fixes your original issue?

n-garc commented 1 year ago

@dstansby This does appear to fix memory use without breaking threading. Once the n_procs + ball_z_size planes are loaded, only about 1-2 threads are needed to keep the queue full. I don't have the documentation handy, but is n_processes configurable for the detection step? It could further help with memory management to not need to keep many threads active (I have a 24 core machine), since there is up to a 2.2GB overhead for each thread (on my machine).

adamltyson commented 1 year ago

@n-garc n_processes isn't directly configurable, but n_free_cpus is, so the number of CPU cores used is based on the number your system has, minus a configurable number to spare. cellfinder also should make sure the RAM of the machine isn't exceeded (using imlib), but this relies on the rest of the detection working properly.

brainglobe / cellfinder-core

Investigate memory usage #73