Possible issue: clustering hurts motion correction

pgunn commented 1 year ago

While testing a recent PR reorganising our clustering code, I ran into an unusual performance issue with demo_pipeline.py, between 'single' and 'multiprocessing' mode. Most of the expensive functions in that file benefit from clustering (as expected). Motion correction was hurt by it.

There's a mystery here and we should figure out between: 1) Is this just an oddity with that sample data? Or just data of that size? 2) Is motion correction something intrinsically non-benefitting from clustering? Maybe the overhead of clustering outweighs its benefits for the algorithms involved, or perhaps it's something more fixable like variant codepaths and how it works with files in that case?

If it's the first, there's probably little to do except perhaps inform users that they might try doing motion correction with or without clustering.

If it's the second, we should look into either always ignoring clustering for MC, or seeing if there are obvious mistakes in the motion correction code when clustering that can be fixed to get better performance.

Another data point is that even when dview is None (mode = single), motion correction still slammed all my CPUs. Unclear why.

Single: Motion Correction: 86.4007637500763 CNMF fit: 14.313889026641846 CNMF refit: 77.24931740760803 CNMF component eval: 1.671086311340332 CNMF component detrend: 1.3886075019836426

Multiprocessing: Motion Correction: 125.87143087387085 CNMF fit: 8.309974431991577 CNMF refit: 12.059678792953491 CNMF component eval: 1.0934557914733887 CNMF component detrend: 1.3722975254058838

EricThomson commented 1 year ago

Pretty standard outputs for me on the demo pipeline are below.

Cluster with 15/16 cores being used:

CPU times: total: 641 ms
Wall time: 24.9 s

Setting dview=None:

CPU times: total: 6min 8s
Wall time: 54.9 s

Not amazing speedup (in terms of wall time).

In general, our motion correction could be much faster. OpenCV has decent cuda support with python now that we could leverage at multiple steps, for instance in the initial template extraction for rigid motion correction. Once we switch over to torch, we could probably do the fft and its inverse (which implements xcorr for motion correction) WAY faster.

pgunn commented 1 year ago

Is that running in the notebook using a timer on motion correction, or something else?

EricThomson commented 1 year ago

Yes mc.motion_correct() where initialization difference is with dview=cluster vs dview=None

pgunn commented 1 year ago

As you're on windows (where some of the backend libraries are very different and parallelisation is also very different) and also on Jupyter, we need more data points.

My original data points were from Linux using the CLI demos. I just modified the demo_pipeline notebook to include the timing code and tested it across multiprocessing and single (still on my beefy Linux workstation), and got the following results:

Single in Jupyter:

Motion Correction: 100.16150164604187
CNMF fit: 26.602813720703125
CNMF refit: 55.068732261657715
CNMF component eval: 1.5480279922485352
CNMF component detrend: 4.016191720962524

Multiprocessing in Jupyter:

Motion Correction: 139.6065218448639
CNMF fit: 23.811944007873535
CNMF refit: 8.747410535812378
CNMF component eval: 1.063417911529541
CNMF component detrend: 4.008931636810303

So at least on Linux, multiprocessing in Jupyter still hurts motion correction (and helps but is neutral on everything else in the notebook). At least with this dataset and these parameters. As expected, Jupyter significantly hurts Caiman's performance (this is not a mystery).

It's really interesting that you're seeing motion correction behave better in parallel than in single mode on Windows. We might want to check to see if we're testing the same way, because that 641ms seems a little suspicious to me.

pgunn commented 1 year ago

import contextlib
import time

class caitimer(contextlib.ContextDecorator):
    def __init__(self, msg):
        self.message = msg
    def __enter__(self):
        self.start = time.time()
        return self
    def __exit__(self, type, value, traceback):
        print(f"{self.message}: {time.time() - self.start}")

I then added this block to the start of "Setup a cluster":

#backend = "multiprocessing"
backend = "single"
#backend = "ipyparallel"

and modify both instances (remember there are two) of setup_cluster() to look at backend rather than the hardcoded name (maybe we should change this in all the notebooks).

Finally, I wrap things I'm interested in with code like this:

with caitimer("Motion Correction"):
    mc.motion_correct(save_movie=True)

Or I could just send you a notebook

EricThomson commented 1 year ago

that 641ms seems a little suspicious to me.

I am not buying it. I basically believe the wall time. I just throw in cell magic %%time in the cell where I run the motion correction fit algo.

pgunn commented 1 year ago

False alarm on all of this. I was testing that other diff with a pip install from the dev branch, which doesn't set our recommended env vars for you, and I forgot to set them manually, which caused bad performance overall as a ridiculous number of processes were spawned even in single mode, with multiprocessing mode creating enough to limit efficiency.

After doing:

export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export OMP_NUM_THREADS=1

I get much more reasonable results (better numbers too).

Single: Motion Correction: 50.08440089225769

Multiprocessing: Motion Correction: 5.482891321182251

I need to remember to do this when testing code.

flatironinstitute / CaImAn

Possible issue: clustering hurts motion correction #1220