write_binary_recording randomly fail when run_sorter is executed by SnakeMake

Hello,

as the issue's title indicates, I have run_sorter randomly failing using kilosort3 during the write_binary_recording process. It has worked many times... but failed many times as well. The only non-default param I use is chunk_duration to 10s since we have long recordings and a powerful server with enormous amounts of RAM and 96 cores. We are sorting 1h long neuropixels recordings.

We are now running it inside a SnakeMake pipeline and are using 64 cores for this job. Originally, I was executing this directly on NAS where our files were stored, which has a ZFS filesystem. That was failing quite often due to shutils.rmtree failing sometimes due to files being still present despite having be deleted or things of the kind. About that our sysadmin said:

"The filesystem is ZFS. The shell access to Ettin doesn't allow for filesystem checking manually. Googling, it is a fairly well-reported issue that ZFS is slow to respond to delete commands, likely due to the copy-on-write nature of ZFS. The NAS's OS is basically a variant FreeNAS, so anything related to that likely applies to ettin (our NAS)"

For that reason, I now execute the run_sorter with an output_folder locally on the computing server and reading from the NAS. And I encounter way too often this write_binary_recording failing at random points on random files.

The script inside the snakemake pipeline running run_sorter can be found here: https://github.com/juliencarponcy/trialexp/blob/spike_sorting/workflows/scripts/s01_sort_ks3.py

The snakemake pipeline itself here: https://github.com/juliencarponcy/trialexp/blob/spike_sorting/workflows/spikesort.smk

I have attached a couple more examples of fails (on files which were then successfully sorted by another run!) but one example is provided below.

Thanks for letting me know if you have a hint on the nature of the issue encountered.

version of spike_interface:

commit 20fdff30c932cc07c3254261969ea226fd0e9c18 (HEAD -> master, origin/master, origin/HEAD)
Merge: 846b3dd3 ba47792a
Author: Garcia Samuel <sam.garcia.die@gmail.com>
Date:   Wed Apr 5 17:48:01 2023 +0200

    Merge pull request #1497 from SpikeInterface/main

    Propagate main into master

Log of one of the error (other attached)

  File "/home/MRC.OX.AC.UK/phar0732/Documents/GitHub/trialexp/.snakemake/scripts/tmp8_w9dgxe.s01_sort_ks3.py", line 91, in <module>
    sorting = ss.run_sorter(
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/site-packages/spikeinterface/sorters/runsorter.py", line 143, in run_sorter
    return run_sorter_local(**common_kwargs)
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/site-packages/spikeinterface/sorters/runsorter.py", line 162, in run_sorter_local
    SorterClass.setup_recording(recording, output_folder, verbose=verbose)
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/site-packages/spikeinterface/sorters/basesorter.py", line 196, in setup_recording
    cls._setup_recording(recording, sorter_output_folder, sorter_params, verbose)
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/site-packages/spikeinterface/sorters/external/kilosortbase.py", line 127, in _setup_recording
    write_binary_recording(recording, file_paths=[binary_file_path],
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/site-packages/spikeinterface/core/core_tools.py", line 280, in write_binary_recording
    executor.run()
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/site-packages/spikeinterface/core/job_tools.py", line 364, in run
    for res in results:
  File "/home/MRC.OX.AC.UK/phar0732/.local/lib/python3.8/site-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

System characteristics:

(base) phar0732@jade:~/Documents/GitHub/trialexp$ uname -a
Linux jade 5.15.0-69-generic #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
(base) phar0732@jade:~/Documents/GitHub/trialexp$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          96
On-line CPU(s) list:             0-95
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7402 24-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2076.834
CPU max MHz:                     2800.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        5599.85
Virtualisation:                  AMD-V
L1d cache:                       1.5 MiB
L1i cache:                       1.5 MiB
L2 cache:                        24 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-23,48-71
NUMA node1 CPU(s):               24-47,72-95
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonst
                                 op_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm s
                                 se4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsb
                                 ase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr
                                  rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_
                                 recov succor smca sme sev sev_es

2023-04-25_failed-sorting-writing-binary.log 2023-04-26_failed-sorting-writing-binary.log 2023-04-26_failed-sorting-writing-binary_bis.log

Just failed again, but I noticed that:

File "/home/MRC.OX.AC.UK/phar0732/.local/lib/python3.8/site-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:

Could it be tied to the progress bar being messed up by jobs running parallel from snakemake but also built-in in SpikeInterface write_binary_recording? I'll try again setting 'progress_bar': False

I will also reset chunk_duration to 1s and hope for the best!

But I also tend to think it could be due to some periodic issues with the server. Any clues pointing to availability of the file to be read could be the issue here?

Salut Julien, nice to see you here.

Maybe it is also due to some memmap and related to how the system cache buffers. We use memmap both for reading/writing. @h-mayorquin and me are making some deep test to find better way to do this. Maybe this is not this at all and mostly due to snakemake itself.

It is hard to guess. What happen if you use less n_jobs ?

Salut Sam! Nice to finally having opted out from Matlab and to touch a bit Neo/SpikeInterface to have a chat ;) , althought I may forget about SpikeInterface and run directly Kilosort from Matlab, sounds more reliable hahahah.

Since you mention memmap I now have this repeating error (after putting chunk_duration to 1s and removing progress_bar. Attached log, Pipeline or run_sorter has not formally crashed yet though, even if i'm expecting it 2023-04-26_failed-sorting-writing-binary_no_bar_1s_chunks.log

Yes I will try with less n_jobs, either through spikeInterface or with less threads in SnakeMake

Update, it stayed frozen, keyboard interrupt gave an endless succession of the following. restarting with less jobs now as run_sorter option in SpikeInterface

  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/multiprocessing/queues.py", line 96, in get
    with self._rlock:
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/multiprocessing/queues.py", line 96, in get
    with self._rlock:
KeyboardInterrupt
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/concurrent/futures/process.py", line 233, in _process_worker
    call_item = call_queue.get(block=True)
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/multiprocessing/queues.py", line 96, in get
    with self._rlock:
  File "/home/MRC.OX.AC.UK/phar0732/anaconda3/envs/trialexp/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt

Keep in mind that having many jobs open in parralel many times the same file with very high IO demand. This extremly efficient on local ssd with ext4 file system. Network drive with complex protocol is maybe not the good solution for this.

What I do in a similar situation (enormoue datat on a nas):

run one session at the same time
make preprocseeing script that save the preproced recording localy with a reasonable n_jobs.
run sorter and all post processing (waveform axtractor locally) with many n_jobs
copy back the clean sorting into the nas
exporter report to the nas
remove the local folders

Hi Samuel,

Yes this is exactly the approach I've switched to a few days ago (copy and execute locally then copy outputs back on the server), except the important detail of not scaling down enough n_jobs for the write_binary_recording phase. So far since yesterday it appears to run all right with n_jobs = 32 I'll keep you up to date whether it just works thanks to luck or if it stays stable. Otherwise, I'll explicitly scale down n_jobs specifically for the write_binary_recording phase.

On a related note, do you think run_sorter by itself (outside write_binary_recording) benefits a lot of multiple cores with kilosort3, or the GPU is doing most of the work and n_jobs won't affect too much the processing speed?

As a consequence, do you consider possible to do a couple of sorting work in parallel if the GPUs still have room for it? (I won't try that now but I'm just curious)

I'll keep updating the issue depending on my outcomes and may end up closing it next week if everything still run smoothly.

Thanks for your help and clarifications, hope to see you soon!

Hi @juliencarponcy

Except for saving to binary, the most important resource for KS is definitely the GPU. Almost all the processing is done there. If you have a large enough GPU, you could try to run multiple runs in parallel, but I haven't tried it myself for large recordings.

Cheers, Alessio

SpikeInterface / spikeinterface

write_binary_recording randomly fail when run_sorter is executed by SnakeMake #1584