SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.
https://spikeinterface.readthedocs.io
MIT License
531 stars 188 forks source link

InvalidStateError while estimating sparsity #3486

Open JeffreyBoucher opened 1 month ago

JeffreyBoucher commented 1 month ago

Hello all!

I've been trying to run kilosort 3 on a concatenated Neuropixels 2 dataset. Lately I've been running into an issue with create_sorting_analyzer, while it is estimating sparsity. Basically, the code is able to run about 70-80% (not any specific number) of the way through, then I get an exception "InvalidStateError". I guess this means that some aspect of my data doesn't work well with the sparsity-estimating algorithm, but I have no guess what that would be.

Here is an example of the exception:

Exception in thread Thread-2: Traceback (most recent call last): File "/home/sjjgjbo/.conda/envs/neurovis_try2/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/home/sjjgjbo/.conda/envs/neurovis_try2/lib/python3.9/concurrent/futures/process.py", line 323, in run self.terminate_broken(cause) File "/home/sjjgjbo/.conda/envs/neurovis_try2/lib/python3.9/concurrent/futures/process.py", line 458, in terminate_broken work_item.future.set_exception(bpe) File "/home/sjjgjbo/.conda/envs/neurovis_try2/lib/python3.9/concurrent/futures/_base.py", line 549, in set_exception raise InvalidStateError('{}: {!r}'.format(self._state, self)) concurrent.futures._base.InvalidStateError: CANCELLED: <Future at 0x2aaf7896ea00 state=cancelled>

Additionally, here are the inputs into create_sorting_analyzer:

we = si.create_sorting_analyzer(recording=rec, sorting=sorting, folder=outDir / 'sortings_folder', format="binary_folder", sparse=True )

Please help me if you can, I would be very greatful, it's been confounding. Let me know if I can offer any additional information that would help!

Thanks,

Jeff Boucher

zm711 commented 1 month ago

I don't think any of us are using 3.9 at this point. But this is good to know. Our test suite is on 3.9 and doesn't have a problem. I'm not sure. I think we need to ping @alejoe91 and @samuelgarcia to take a look at this.

What n_jobs are you using? Could you try n_jobs=1 to confirm it is a multiprocessing issue.?

JeffreyBoucher commented 1 month ago

Hello!

Thanks for the response! I am using n_jobs -1, which should be 10 cores on the cluster I'm using. I'll try with n_jobs = 1 and get back to you!

zm711 commented 1 month ago

Thanks let us know with the n_jobs=1. Sometimes there can be issues with how a server shares resources so we need to trouble shoot 3 things: 1) python 3.9 issue 2) multiprocessing issue 3) spikeinterface + server issue

JeffreyBoucher commented 1 month ago

Hello!

Setting n_jobs = 1 indeed let me get through the sparsity estimation without error! Naturally this takes much longer to do, though.

Since you have implied I might be able to solve my problem by updating from python 3.9 I'll maybe give that a shot next. It's been a while since I chose my version, but I think having 3.9 isn't critical at this stage of the pipeline.

Thanks for your help! I'll let you know if changing versions doesn't solve it for me; let me know if you need any more information from me.

Jeffrey Boucher

zm711 commented 1 month ago

Yeah it would be great if you could test python 3.10 or 3.11. There have been some improvements in multiprocessing at the python level. If updating python works it tells us that 3.9 might not be as well supported as we thought for our multiprocessing. If 3.10/3.11/3.12 doesn't work then it might be a problem in our multiprocessing itself.

JeffreyBoucher commented 1 month ago

Hello!

Unfortunately, I still got the same error with python 3.11!

:estimate_sparsity: 70%|███████ | 7983/11378 [3:11:25<36:45, 1.54it/s] estimate_sparsity: 70%|███████ | 7990/11378 [3:11:40<1:21:16, 1.44s/it]Exception in thread Thread-2:

Traceback (most recent call last): File "/home/sjjgjbo/.conda/envs/spikesortEnv/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/home/sjjgjbo/.conda/envs/spikesortEnv/lib/python3.11/concurrent/futures/process.py", line 347, in run self.terminate_broken(cause) File "/home/sjjgjbo/.conda/envs/spikesortEnv/lib/python3.11/concurrent/futures/process.py", line 499, in terminate_broken work_item.future.set_exception(bpe) File "/home/sjjgjbo/.conda/envs/spikesortEnv/lib/python3.11/concurrent/futures/_base.py", line 559, in set_exception raise InvalidStateError('{}: {!r}'.format(self._state, self)) concurrent.futures._base.InvalidStateError: CANCELLED: <Future at 0x2b409bed9d90 state=cancelled>

Any advice on what to try next? Anything you might also want to look at?

Thanks!

Jeff Boucher

zm711 commented 1 month ago

Thanks for that info! A few more background questions then:

What OS are you using (looks like linux maybe, which flavor?)? Is this on a server or locally? If on a server what is your local OS that you are communicating with the server with?

Could you do a conda list or pip list of version numbers for your packages in the environment?

Could you give us the stats on your recording object? If you just type recording into your terminal the repr should tell us file size/dtype/number of samples?

JeffreyBoucher commented 1 month ago

Hello!

I am indeed using linux. Here is the output of cat /etc/os-release:

" NAME="Red Hat Enterprise Linux Server" VERSION="7.8 (Maipo)" ID="rhel" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="7.8" PRETTY_NAME="Red Hat Enterprise Linux" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.8:GA:server" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.8 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.8"

"

This is on a server; it's a cluster organized by the university I work at (myriad at UCL). Because of this, when I run the spikesorter I am interfacing with a job submission scheduler. My local OS is linux as well, and is Ubuntu 22.04.

Here is the output of the conda list:

" _libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
asciitree 0.3.3 pypi_0 pypi blas 1.0 mkl
bzip2 1.0.8 h5eee18b_6
ca-certificates 2024.9.24 h06a4308_0
contourpy 1.3.0 pypi_0 pypi cuda-python 12.6.0 pypi_0 pypi cudatoolkit 10.1.243 h6bb024c_0
cudnn 7.6.5 cuda10.1_0
cycler 0.12.1 pypi_0 pypi distinctipy 1.3.4 pypi_0 pypi fasteners 0.19 pypi_0 pypi fastrlock 0.5 pypi_0 pypi filelock 3.13.1 pypi_0 pypi fonttools 4.54.1 pypi_0 pypi fsspec 2024.6.1 pypi_0 pypi gmp 6.2.1 h295c915_3
gmpy2 2.1.2 pypi_0 pypi h5py 3.12.1 pypi_0 pypi intel-openmp 2023.1.0 hdb19cb5_46306
jinja2 3.1.4 pypi_0 pypi joblib 1.4.2 pypi_0 pypi kiwisolver 1.4.7 pypi_0 pypi ld_impl_linux-64 2.38 h1181459_1
libabseil 20240116.2 cxx17_h6a678d5_0
libffi 3.4.4 h6a678d5_1
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libprotobuf 4.25.3 he621ea3_0
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
llvmlite 0.43.0 pypi_0 pypi markupsafe 2.1.3 pypi_0 pypi matplotlib 3.9.2 pypi_0 pypi mkl 2023.1.0 h213fc3f_46344
mkl-fft 1.3.8 pypi_0 pypi mkl-random 1.2.4 pypi_0 pypi mkl-service 2.4.0 pypi_0 pypi mkl_fft 1.3.8 py311h5eee18b_0
mkl_random 1.2.4 py311hdb19cb5_0
mpc 1.1.0 h10f8cd9_1
mpfr 4.0.2 hb69a4c5_1
mpmath 1.3.0 pypi_0 pypi mtscomp 1.0.2 pypi_0 pypi nccl 2.8.3.1 hcaf9a05_0
ncurses 6.4 h6a678d5_0
neo 0.13.4 pypi_0 pypi networkx 3.3 pypi_0 pypi numba 0.60.0 pypi_0 pypi numcodecs 0.13.1 pypi_0 pypi numpy 1.26.4 pypi_0 pypi numpy-base 1.26.4 py311hf175353_0
openssl 3.0.15 h5eee18b_0
packaging 24.1 pypi_0 pypi pandas 2.2.3 pypi_0 pypi pillow 11.0.0 pypi_0 pypi pip 24.2 pypi_0 pypi probeinterface 0.2.24 pypi_0 pypi pyparsing 3.2.0 pypi_0 pypi python 3.11.10 he870216_0
python-dateutil 2.9.0.post0 pypi_0 pypi pytorch 2.3.0 cpu_py311h6fe12db_1
pytz 2024.2 pypi_0 pypi quantities 0.16.1 pypi_0 pypi readline 8.2 h5eee18b_0
scikit-learn 1.5.2 pypi_0 pypi scipy 1.14.1 pypi_0 pypi setuptools 72.1.0 pypi_0 pypi six 1.16.0 pypi_0 pypi spikeinterface 0.101.2 pypi_0 pypi sqlite 3.45.3 h5eee18b_0
sympy 1.13.2 pypi_0 pypi tbb 2021.8.0 hdb19cb5_0
threadpoolctl 3.5.0 pypi_0 pypi tk 8.6.14 h39e8969_0
torch 2.3.0 pypi_0 pypi tqdm 4.66.4 pypi_0 pypi typing-extensions 4.11.0 pypi_0 pypi typing_extensions 4.11.0 py311h06a4308_0
tzdata 2024.2 pypi_0 pypi wheel 0.43.0 pypi_0 pypi xz 5.4.6 h5eee18b_1
zarr 2.17.2 pypi_0 pypi zlib 1.2.13 h5eee18b_1

"

I'll get you the recording stats momentarily...

JeffreyBoucher commented 1 month ago

The recording I am currently working with outputs:

" ConcatenateSegmentRecording: 384 channels - 30.0kHz - 1 segments - 341,317,801 samples 11,377.26s (3.16 hours) - int16 dtype - 244.13 GiB "

It's a set of concatenated recordings taken over a period of about a week and a half.

Thanks for your help!

Jeff Boucher

zm711 commented 1 month ago

Could you try running just one of the recordings and see if that works with n_jobs. I remember vaguely that we had a problem with certain concatenations so I would like to test this.

@h-mayorquin do you remember this too? That giant concatenations were causing problems with multiprocessing?

the issue with this is the best way for us to fix this is to have the data to try to fix it with but sharing ~250 GB is a non-trivial thing :)

Maybe @samuelgarcia or @alejoe91 also have opinions about why multiprocessing is failing with concatenation (and they both use linux!).

samuelgarcia commented 1 month ago

Salut. Are you running the script using slurm ? In my lab, the slurm kill jobs because the way slurm is counting memory is wrong. With shared mem, every process is memory is cumulated that overflow the slurm limits but the real machine limits is OK. Can could test with less process and more thread ? n_jobs=6, max_thread_perprocess=8 for instance ?

JeffreyBoucher commented 4 weeks ago

Hello!

I'll run a single session dataset overnight tonight.

We are not using slurm; the cluster seems to be using "SGE 8.1.9", which stands for "Sun Grid Engine". I don't know if there would be a similar problem with this; I'll try to do the single session dataset first

JeffreyBoucher commented 3 weeks ago

Hello!

In fact I ran into a bug which I think is on my end; I'm going to de-prioritize this for a bit, since I was able to get it working by turning off parallel processing I want to get that started on my real dataset, but afterward I'll get back to this (within a week)

Thanks for your help!

Jeff Boucher

samuelgarcia commented 3 weeks ago

Maybe SGE is killing your job because using too much ram. Could you increase the mem when submiting the job ?

JeffreyBoucher commented 3 weeks ago

Hello

Parallel processing worked fine for a single session; for that and other reasons, I think that the suggestion to request more RAM for my jobs is a good one. I'll try it!

Thanks,

Jeff