Closed pgunn closed 1 year ago
Pretty standard outputs for me on the demo pipeline are below.
Cluster with 15/16 cores being used:
CPU times: total: 641 ms
Wall time: 24.9 s
Setting dview=None
:
CPU times: total: 6min 8s
Wall time: 54.9 s
Not amazing speedup (in terms of wall time).
In general, our motion correction could be much faster. OpenCV has decent cuda support with python now that we could leverage at multiple steps, for instance in the initial template extraction for rigid motion correction. Once we switch over to torch, we could probably do the fft and its inverse (which implements xcorr for motion correction) WAY faster.
Is that running in the notebook using a timer on motion correction, or something else?
Yes mc.motion_correct()
where initialization difference is with dview=cluster
vs dview=None
As you're on windows (where some of the backend libraries are very different and parallelisation is also very different) and also on Jupyter, we need more data points.
My original data points were from Linux using the CLI demos. I just modified the demo_pipeline notebook to include the timing code and tested it across multiprocessing and single (still on my beefy Linux workstation), and got the following results:
Single in Jupyter:
Multiprocessing in Jupyter:
So at least on Linux, multiprocessing in Jupyter still hurts motion correction (and helps but is neutral on everything else in the notebook). At least with this dataset and these parameters. As expected, Jupyter significantly hurts Caiman's performance (this is not a mystery).
It's really interesting that you're seeing motion correction behave better in parallel than in single mode on Windows. We might want to check to see if we're testing the same way, because that 641ms seems a little suspicious to me.
import contextlib
import time
class caitimer(contextlib.ContextDecorator):
def __init__(self, msg):
self.message = msg
def __enter__(self):
self.start = time.time()
return self
def __exit__(self, type, value, traceback):
print(f"{self.message}: {time.time() - self.start}")
I then added this block to the start of "Setup a cluster":
#backend = "multiprocessing"
backend = "single"
#backend = "ipyparallel"
and modify both instances (remember there are two) of setup_cluster() to look at backend rather than the hardcoded name (maybe we should change this in all the notebooks).
Finally, I wrap things I'm interested in with code like this:
with caitimer("Motion Correction"):
mc.motion_correct(save_movie=True)
Or I could just send you a notebook
that 641ms seems a little suspicious to me.
I am not buying it. I basically believe the wall time. I just throw in cell magic %%time
in the cell where I run the motion correction fit algo.
False alarm on all of this. I was testing that other diff with a pip install from the dev branch, which doesn't set our recommended env vars for you, and I forgot to set them manually, which caused bad performance overall as a ridiculous number of processes were spawned even in single mode, with multiprocessing mode creating enough to limit efficiency.
After doing:
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export OMP_NUM_THREADS=1
I get much more reasonable results (better numbers too).
Single: Motion Correction: 50.08440089225769
Multiprocessing: Motion Correction: 5.482891321182251
I need to remember to do this when testing code.
While testing a recent PR reorganising our clustering code, I ran into an unusual performance issue with demo_pipeline.py, between 'single' and 'multiprocessing' mode. Most of the expensive functions in that file benefit from clustering (as expected). Motion correction was hurt by it.
There's a mystery here and we should figure out between: 1) Is this just an oddity with that sample data? Or just data of that size? 2) Is motion correction something intrinsically non-benefitting from clustering? Maybe the overhead of clustering outweighs its benefits for the algorithms involved, or perhaps it's something more fixable like variant codepaths and how it works with files in that case?
If it's the first, there's probably little to do except perhaps inform users that they might try doing motion correction with or without clustering.
If it's the second, we should look into either always ignoring clustering for MC, or seeing if there are obvious mistakes in the motion correction code when clustering that can be fixed to get better performance.
Another data point is that even when dview is None (mode = single), motion correction still slammed all my CPUs. Unclear why.
Single: Motion Correction: 86.4007637500763 CNMF fit: 14.313889026641846 CNMF refit: 77.24931740760803 CNMF component eval: 1.671086311340332 CNMF component detrend: 1.3886075019836426
Multiprocessing: Motion Correction: 125.87143087387085 CNMF fit: 8.309974431991577 CNMF refit: 12.059678792953491 CNMF component eval: 1.0934557914733887 CNMF component detrend: 1.3722975254058838