Binary file saving doesn't allow multiple jobs

RikkelBob commented 1 year ago

Running recording.save(folder=file_path_pre,n_jobs=2,format='binary') results in the error trace below. Anything above n_jobs=1 causes this error, so it must have something to do with parallel processing. It seems that my full python script is rerun (i.e., everything I print to terminal before calling .save is printed again). This causes .save to create the output folder multiple times, resulting in the error below.

Traceback (most recent call last): File "", line 1, in File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\multiprocessing\spawn.py", line 120, in spawn_main exitcode = _main(fd, parent_sentinel) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\multiprocessing\spawn.py", line 129, in _main prepare(preparation_data) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\multiprocessing\spawn.py", line 240, in prepare _fixup_main_from_path(data['init_main_from_path']) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\multiprocessing\spawn.py", line 291, in _fixup_main_from_path main_content = runpy.run_path(main_path, ^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 291, in run_path File "", line 98, in _run_module_code File "", line 88, in _run_code File "C:\Users\Cheetah\PycharmProjects\spikeInterface\main.py", line 65, in recording_saved = recording_cmr.save(folder=file_path_pre,n_jobs=2, format='binary') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cheetah\PycharmProjects\spikeInterface\venv\Lib\site-packages\spikeinterface\core\base.py", line 621, in save loaded_extractor = self.save_to_folder(kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cheetah\PycharmProjects\spikeInterface\venv\Lib\site-packages\spikeinterface\core\base.py", line 684, in save_to_folder assert not folder.exists(), f'folder {folder} already exists, choose another name' AssertionError: folder C:\CheetahData\bench\preprocessed\2023-04-13_13-56-33 already exists, choose another name Traceback (most recent call last): File "", line 1, in File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\multiprocessing\spawn.py", line 120, in spawn_main exitcode = _main(fd, parent_sentinel) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\multiprocessing\spawn.py", line 129, in _main prepare(preparation_data) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\multiprocessing\spawn.py", line 240, in prepare _fixup_main_from_path(data['init_main_from_path']) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\multiprocessing\spawn.py", line 291, in _fixup_main_from_path main_content = runpy.run_path(main_path, ^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 291, in run_path File "", line 98, in _run_module_code File "", line 88, in _run_code File "C:\Users\Cheetah\PycharmProjects\spikeInterface\main.py", line 65, in recording_saved = recording_cmr.save(folder=file_path_pre,n_jobs=2, format='binary') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cheetah\PycharmProjects\spikeInterface\venv\Lib\site-packages\spikeinterface\core\base.py", line 621, in save loaded_extractor = self.save_to_folder(kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cheetah\PycharmProjects\spikeInterface\venv\Lib\site-packages\spikeinterface\core\base.py", line 684, in save_to_folder assert not folder.exists(), f'folder {folder} already exists, choose another name' AssertionError: folder C:\CheetahData\bench\preprocessed\2023-04-13_13-56-33 already exists, choose another name write_binary_recording: 0%| | 0/6585 [00:02<?, ?it/s] Traceback (most recent call last): File "C:\Users\Cheetah\PycharmProjects\spikeInterface\main.py", line 65, in recording_saved = recording_cmr.save(folder=file_path_pre,n_jobs=2, format='binary') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cheetah\PycharmProjects\spikeInterface\venv\Lib\site-packages\spikeinterface\core\base.py", line 621, in save loaded_extractor = self.save_to_folder(kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cheetah\PycharmProjects\spikeInterface\venv\Lib\site-packages\spikeinterface\core\base.py", line 700, in save_to_folder cached = self._save(folder=folder, verbose=verbose, save_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cheetah\PycharmProjects\spikeInterface\venv\Lib\site-packages\spikeinterface\core\baserecording.py", line 297, in _save write_binary_recording(self, file_paths=file_paths, dtype=dtype, **job_kwargs) File "C:\Users\Cheetah\PycharmProjects\spikeInterface\venv\Lib\site-packages\spikeinterface\core\core_tools.py", line 280, in write_binary_recording executor.run() File "C:\Users\Cheetah\PycharmProjects\spikeInterface\venv\Lib\site-packages\spikeinterface\core\job_tools.py", line 364, in run for res in results: File "C:\Users\Cheetah\PycharmProjects\spikeInterface\venv\Lib\site-packages\tqdm\std.py", line 1178, in iter for obj in iterable: File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\concurrent\futures\process.py", line 602, in _chain_from_iterable_of_lists for element in iterable: File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64__qbz5n2kfra8p0\Lib\concurrent\futures_base.py", line 619, in result_iterator yield _result_or_cancel(fs.pop()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\concurrent\futures_base.py", line 317, in _result_or_cancel return fut.result(timeout) ^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\concurrent\futures_base.py", line 456, in result return self.get_result() ^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64qbz5n2kfra8p0\Lib\concurrent\futures_base.py", line 401, in get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Process finished with exit code 1

alejoe91 commented 1 year ago

What is your recording?

print(recording)

RikkelBob commented 1 year ago

BinaryRecordingExtractor: 64 channels - 32.0kHz - 1 segments - 210,713,088 samples 6,584.78s (1.83 hours) - int16 dtype - 25.12 GiB

alejoe91 commented 1 year ago

Can you share the full script?

RikkelBob commented 1 year ago

import spikeinterface as si
import spikeinterface.extractors as se
import spikeinterface.preprocessing as spre
import spikeinterface.sorters as ss
import spikeinterface.postprocessing as spost
import spikeinterface.qualitymetrics as sqm
import spikeinterface.comparison as sc
import spikeinterface.exporters as sexp
import spikeinterface.widgets as sw
import probeinterface as pi
import utils
import docker
import shutil
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import os
import warnings
from probeinterface.plotting import plot_probe

fs = 32000
n_chan = 64
dtype = 'int16'
snippet = False
reorder = True
plot_timeseries = False

animal = "008"
session = "2023-04-13_13-56-33"
root_path = "C:/CheetahData/bench/"
file_name = "ksData_probe3.dat"
file_path = Path(root_path, session, file_name)
file_path_pre = Path(root_path, "preprocessed", session)
probe, positions = utils.gen_probe()
print(probe)

sorter_name = "mountainsort5"
output_folder = sorter_name + "-results"

recording = si.read_binary(file_path, fs, n_chan, dtype)

if snippet:
    start_time_s = 0  # start time in seconds
    end_time_s = 600  # end time in seconds

    start_frame = start_time_s * fs  # start frame
    end_frame = end_time_s * fs  # end frame

    # Create the new recording
    recording = recording.frame_slice(start_frame, end_frame)

recording.annotate(is_filtered=False)
recording.set_probe(probe, in_place=True)

channel_ids = recording.get_channel_ids()
fs = recording.get_sampling_frequency()
num_chan = recording.get_num_channels()
num_segments = recording.get_num_segments()

print(f'Channel ids: {channel_ids}')
print(f'Sampling frequency: {fs}')
print(f'Number of channels: {num_chan}')
print(f"Number of segments: {num_segments}")

w_ts = sw.plot_timeseries(recording, order_channel_by_depth=True, channel_ids=probe.device_channel_indices)

recording_f = spre.bandpass_filter(recording, freq_min=300, freq_max=6000)
w_f = sw.plot_timeseries(recording_f, order_channel_by_depth=True)

recording_cmr = spre.common_reference(recording_f, operator="median", reference="global")
w_cmr = sw.plot_timeseries(recording_cmr, order_channel_by_depth=True)

if plot_timeseries:
    plt.show()

if os.path.isdir(file_path_pre):
    recording_saved = si.read_binary(Path(file_path_pre,"traces_cached_seg0.RAW"), fs, n_chan, dtype)
    recording_saved.set_probe(probe,in_place=True)
else:
    recording_saved = recording_cmr.save(folder=file_path_pre,n_jobs=1, format='binary')
    recording_saved.set_probe(probe,in_place=True)

print(ss.installed_sorters())

sorting = ss.run_sorter(sorter_name=sorter_name,
                        recording=recording_saved,
                        output_folder=output_folder,
                        docker_image=True,
                        verbose=True) # add num_workers?

print('end')

h-mayorquin commented 1 year ago

@RikkelBob I am confused about your trace, it seems that you are callling main (which is the script that you shared with us?) from within a multiprocessing call?

Is running the script that you just shared above causing you an error in the following line?

    recording_saved = recording_cmr.save(folder=file_path_pre,n_jobs=1, format='binary')

If so, can you run the following script in your system to see if this generates th error? (warning, this will generate a folder of 25 GiB):

from spikeinterface.core.generate import generate_lazy_recording
from probeinterface import Probe
import spikeinterface.preprocessing as spre
import numpy as np

full_traces_size_GiB = 25.0

large_recording = generate_lazy_recording(full_traces_size_GiB=full_traces_size_GiB)
fs = large_recording.get_sampling_frequency()

binary_recording = large_recording.save()

recording = binary_recording
start_time_s = 0  # start time in seconds
end_time_s = 600  # end time in seconds

start_frame = start_time_s * fs  # start frame
end_frame = end_time_s * fs  # end frame
end_frame = min(end_frame, recording.get_num_frames())  # make sure it does not go over the end
# Create the new recording
recording = recording.frame_slice(start_frame, end_frame)
recording.annotate(is_filtered=False)

recording_f = spre.bandpass_filter(recording, freq_min=300, freq_max=6000)
recording_cmr = spre.common_reference(recording_f, operator="median", reference="global")

recording_saved = recording_cmr.save(n_jobs=2, format="binary")

rat-h commented 1 year ago

@h-mayorquin I modified your example, which now reproduces the error on my side. I think it is a minimal code to reproduce the problem.

from spikeinterface.core.generate import generate_lazy_recording
from probeinterface import Probe
import spikeinterface.full as si
import numpy as np

full_traces_size_GiB = 5.5

large_recording = generate_lazy_recording(full_traces_size_GiB=full_traces_size_GiB)
fs = large_recording.get_sampling_frequency()

binary_recording = large_recording.save(folder="base")

recording = binary_recording
start_time_s = 0  # start time in seconds
end_time_s = 600  # end time in seconds

start_frame = start_time_s * fs  # start frame
end_frame = end_time_s * fs  # end frame
end_frame = min(end_frame, recording.get_num_frames())  # make sure it does not go over the end
# Create the new recording
recording = recording.frame_slice(start_frame, end_frame)
recording.annotate(is_filtered=False)

recording_hp = si.filter(recording,btype='highpass',band=300)
recording_cmr = si.common_reference(recording_hp, operator="median")
recording_saved = recording_cmr.save(folder="preprocessed", n_jobs=16, total_memory="2G", progress_bar=True,chunk_duration='1m')

It passes the first save function - large_recording.save(folder="base")),but crashes on the second - recording_cmr.save(folder="preprocessed", n_jobs=16, total_memory="2G", progress_bar=True,chunk_duration='1m').

python test.py
write_binary_recording with n_jobs = 1 and chunk_size = 30000
write_binary_recording: 100%|############################################################################################################################| 49/49 [00:39<00:00,  1.23it/s]
write_binary_recording with n_jobs = 16 and chunk_size = 30517
write_binary_recording:   0%|                                                                                                                                     | 0/48 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/lustre/groups/colonneselab/v01-spikesorting-GAopt-20230814/test.py", line 28, in <module>
    recording_saved = recording_cmr.save(folder="preprocessed", n_jobs=16, total_memory="2G", progress_bar=True,chunk_duration='1m')
  File "/SMHS/home/rath/.local/lib/python3.10/site-packages/spikeinterface/core/base.py", line 749, in save
    loaded_extractor = self.save_to_folder(**kwargs)
  File "/SMHS/home/rath/.local/lib/python3.10/site-packages/spikeinterface/core/base.py", line 825, in save_to_folder
    cached = self._save(folder=folder, verbose=verbose, **save_kwargs)
  File "/SMHS/home/rath/.local/lib/python3.10/site-packages/spikeinterface/core/baserecording.py", line 444, in _save
    write_binary_recording(self, file_paths=file_paths, dtype=dtype, **job_kwargs)
  File "/SMHS/home/rath/.local/lib/python3.10/site-packages/spikeinterface/core/core_tools.py", line 314, in write_binary_recording
    executor.run()
  File "/SMHS/home/rath/.local/lib/python3.10/site-packages/spikeinterface/core/job_tools.py", line 400, in run
    for res in results:
  File "/SMHS/home/rath/.local/lib/python3.10/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/SMHS/home/rath/.local/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/SMHS/home/rath/.local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/SMHS/home/rath/.local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/SMHS/home/rath/.local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/SMHS/home/rath/.local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Again this happened after the system update and I suspect that it may be because of OpenMP and some problem with gcc libraries.

gcc gcc (GCC) 10.2.0
python Python 3.10.12 (main, Aug 16 2023, 00:07:08) [GCC 10.2.0] on linux
spikeinterface '0.98.2'

rat-h commented 1 year ago

@h-mayorquin @alejoe91 here is a conundrum!

The code below runs just fine


# import spikeinterface.full as si
# si.set_global_tmp_folder("spikeiteface.cache")

from spikeinterface.core.generate import generate_lazy_recording
from probeinterface import Probe
import spikeinterface.preprocessing as spre
import numpy as np
import os

os.system('rm -fR base preprocessed')

full_traces_size_GiB = 1.

large_recording = generate_lazy_recording(full_traces_size_GiB=full_traces_size_GiB)
fs = large_recording.get_sampling_frequency()

binary_recording = large_recording.save(folder='base')

recording = binary_recording
start_time_s = 0  # start time in seconds
end_time_s = 600  # end time in seconds

start_frame = start_time_s * fs  # start frame
end_frame = end_time_s * fs  # end frame
end_frame = min(end_frame, recording.get_num_frames())  # make sure it does not go over the end
# Create the new recording
recording = recording.frame_slice(start_frame, end_frame)
recording.annotate(is_filtered=False)

recording_f = spre.bandpass_filter(recording, freq_min=300, freq_max=6000)
recording_cmr = spre.common_reference(recording_f, operator="median", reference="global")

recording_saved = recording_cmr.save(n_jobs=-1, folder='preprocessed')

$ python  test-spikeinterface-orig.py
write_binary_recording with n_jobs = 1 and chunk_size = 30000
write_binary_recording: 100%|###############################################################################################################################| 49/49 [00:21<00:00,  2.32it/s]
write_binary_recording with n_jobs = 40 and chunk_size = 30000
write_binary_recording: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:38<00:00,  1.29it/s]

However, if I import spikeinterface.full as si, without even using it, it crashes!

import spikeinterface.full as si
# si.set_global_tmp_folder("spikeiteface.cache")

from spikeinterface.core.generate import generate_lazy_recording
from probeinterface import Probe
import spikeinterface.preprocessing as spre
import numpy as np
import os

os.system('rm -fR base preprocessed')

full_traces_size_GiB = 1.

large_recording = generate_lazy_recording(full_traces_size_GiB=full_traces_size_GiB)
fs = large_recording.get_sampling_frequency()

binary_recording = large_recording.save(folder='base')

recording = binary_recording
start_time_s = 0  # start time in seconds
end_time_s = 600  # end time in seconds

start_frame = start_time_s * fs  # start frame
end_frame = end_time_s * fs  # end frame
end_frame = min(end_frame, recording.get_num_frames())  # make sure it does not go over the end
# Create the new recording
recording = recording.frame_slice(start_frame, end_frame)
recording.annotate(is_filtered=False)

recording_f = spre.bandpass_filter(recording, freq_min=300, freq_max=6000)
recording_cmr = spre.common_reference(recording_f, operator="median", reference="global")

recording_saved = recording_cmr.save(n_jobs=-1, folder='preprocessed')

$ python  test-spikeinterface-orig.py
write_binary_recording with n_jobs = 1 and chunk_size = 30000
write_binary_recording: 100%|###############################################################################################################################| 49/49 [00:21<00:00,  2.33it/s]
write_binary_recording with n_jobs = 40 and chunk_size = 30000
write_binary_recording:   0%|                                                                                                                                        | 0/49 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/lustre/groups/colonneselab/test-spikeinterface-orig.py", line 36, in <module>
    recording_saved = recording_cmr.save(n_jobs=-1, folder='/local/preprocessed')
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/SMHS/home/rath/.local/lib/python3.11/site-packages/spikeinterface/core/base.py", line 749, in save
    loaded_extractor = self.save_to_folder(**kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/SMHS/home/rath/.local/lib/python3.11/site-packages/spikeinterface/core/base.py", line 825, in save_to_folder
    cached = self._save(folder=folder, verbose=verbose, **save_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/SMHS/home/rath/.local/lib/python3.11/site-packages/spikeinterface/core/baserecording.py", line 444, in _save
    write_binary_recording(self, file_paths=file_paths, dtype=dtype, **job_kwargs)
  File "/SMHS/home/rath/.local/lib/python3.11/site-packages/spikeinterface/core/core_tools.py", line 314, in write_binary_recording
    executor.run()
  File "/SMHS/home/rath/.local/lib/python3.11/site-packages/spikeinterface/core/job_tools.py", line 400, in run
    for res in results:
  File "/SMHS/home/rath/.local/lib/python3.11/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/SMHS/home/rath/.local/lib/python3.11/concurrent/futures/process.py", line 602, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/SMHS/home/rath/.local/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/SMHS/home/rath/.local/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/SMHS/home/rath/.local/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/SMHS/home/rath/.local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Note, this problem appeared after the operating system was updated on our cluster. They are currently using RockyLinux 8.8 GeneralCloud image. This behavior can be reproduced in a virtual machine using qenu-kvm:

wget https://download.rockylinux.org/pub/rocky/8/images/x86_64/Rocky-8-GenericCloud-Base.latest.x86_64.qcow2
virt-install --disk Rocky-8-GenericCloud-Base.latest.x86_64.qcow2 --memory 8192 --vcpus 4 --cloud-init --os-variant rocky8 --import --name RockyLinux-8.8-CloudBase
copy password from the screen
login as root and password you copied
update password for root
set up password for rocky user: passwd rocky
installing python dnf install python3.11 python3.11-pip python3.11-wheel
poweroff poweroff -n
in virt-namager GUI: open new virtual machine and enable shared memory and hit applay
add hardware > Filesystem > set shared directory on the host in Source path, and Target path as shared-drive
on the host create .py file with the script above in location, which shared with guest
run virtual-machine
login as rocky
sudo mount -t virtiofs shared-drive /mnt
cd /mnt
pip3 install --user 'spikeinterface[full]'
python3 (name of the script)
remove # from the first line of the script, i.e. import spikeinterface.full as si
run it again python3 (name of the script)

Screenshot_20230821_163309

Any help with this issue is highly appreciated! -rth

h-mayorquin commented 1 year ago

@rat-h Hi, thanks for looking deeper into this. I am on leave right now so I don't have enough time to reproduce the error with a vritual machine (I read your script in my system with and without the full import and it runs fine).

It is indeed very strange that doing the full import generates the error. The first question is, do you need it? I personally never use it on my development and as you probably know is an anti-pattern. That said, it still reveals that there is something fishy going on. I suggest the following:

Can you avoid the error if you remove a specific type of pre-processing that you have (need to see if there is a specific computation that is generating the eroror)
Could you try using a different mp_context (spawn) to see if the error stil shows up: https://github.com/SpikeInterface/spikeinterface/blob/73272a3491f8b6b553a81e9f4121f87cf54cad30/src/spikeinterface/core/job_tools.py#L290-L293
Can you see what part of the full import is generating the conflict? You can comment out the sub full imports in the full import script to see exactly what is interfering https://github.com/SpikeInterface/spikeinterface/blob/73272a3491f8b6b553a81e9f4121f87cf54cad30/src/spikeinterface/full.py#L1-L26
Does the error show up with python 3.10?

rat-h commented 1 year ago

@h-mayorquin sorry to bother you on your leave.

I read your script in my system with and without the full import and it runs fine

YES! Same on my desktop computer. It is specific to this particular distribution, Rocky Linux, and I can't get my head around why!

Can you avoid the error if you remove a specific type of pre-processing that you have (need to see if there is a specific computation that is generating the eroror)

I did a few tests and couldn't find any preprocessing which caused the error.

Can you see what part of the full import is generating the conflict? You can comment out the sub full imports in the full import script to see exactly what is interfering

That was a pretty funny game, but after trying all of them one-by-one, whenever any of the subs below is imported causes the problem.

from .postprocessing import *
from .qualitymetrics import *
from .curation import *
from .comparison import *
from .widgets import *
from .exporters import *

Surprisingly, none of them are about preprocessing!

Does the error show up with python 3.10?

Yes. I have tried

Installed from dnf python3.11
Installed from dnf python3.9
Compiled from source code python3.9
Compiled from source code python3.10
Compiled from source code python3.11

Could you try using a different mp_context (spawn) to see if the error stil shows up:

If I add mp_context="spawn" to the last save call, the error is still there, but something else appeared in the error message. Code:

recording_saved = recording_cmr.save(n_jobs=-1, folder='preprocessed',mp_context="spawn")

Error:


write_binary_recording:   0%|          | 0/9 [00:00<?, ?it/s]
write_binary_recording:  11%|#1        | 1/9 [00:00<00:05,  1.60it/s]
write_binary_recording:  22%|##2       | 2/9 [00:01<00:04,  1.53it/s]
write_binary_recording:  33%|###3      | 3/9 [00:01<00:03,  1.51it/s]
write_binary_recording:  44%|####4     | 4/9 [00:02<00:03,  1.53it/s]
write_binary_recording:  56%|#####5    | 5/9 [00:03<00:02,  1.55it/s]
write_binary_recording:  67%|######6   | 6/9 [00:03<00:01,  1.53it/s]
write_binary_recording:  78%|#######7  | 7/9 [00:04<00:01,  1.52it/s]
write_binary_recording:  89%|########8 | 8/9 [00:05<00:00,  1.51it/s]
write_binary_recording: 100%|##########| 9/9 [00:05<00:00,  1.64it/s]
write_binary_recording: 100%|##########| 9/9 [00:05<00:00,  1.57it/s]

write_binary_recording:   0%|          | 0/9 [00:00<?, ?it/s]
write_binary_recording:   0%|          | 0/9 [00:00<?, ?it/s]
write_binary_recording:  11%|#1        | 1/9 [00:00<00:04,  1.76it/s]
write_binary_recording:   0%|          | 0/9 [00:00<?, ?it/s]Traceback (most recent call last):
  File "<string>", line 1, in <module>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib64/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/spawn.py", line 129, in _main
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/spawn.py", line 129, in _main
    prepare(preparation_data)
  File "/usr/lib64/python3.11/multiprocessing/spawn.py", line 240, in prepare
    prepare(preparation_data)
  File "/usr/lib64/python3.11/multiprocessing/spawn.py", line 240, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib64/python3.11/multiprocessing/spawn.py", line 291, in _fixup_main_from_path
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib64/python3.11/multiprocessing/spawn.py", line 291, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/test-spikeinterface-orig.py", line 19, in <module>
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/test-spikeinterface-orig.py", line 19, in <module>
    binary_recording = large_recording.save(folder='base')
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rocky/.local/lib/python3.11/site-packages/spikeinterface/core/base.py", line 749, in save
    binary_recording = large_recording.save(folder='base')
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    loaded_extractor = self.save_to_folder(**kwargs)
  File "/home/rocky/.local/lib/python3.11/site-packages/spikeinterface/core/base.py", line 749, in save
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rocky/.local/lib/python3.11/site-packages/spikeinterface/core/base.py", line 812, in save_to_folder
    loaded_extractor = self.save_to_folder(**kwargs)
    assert not folder.exists(), f"folder {folder} already exists, choose another name"
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: folder base already exists, choose another name
  File "/home/rocky/.local/lib/python3.11/site-packages/spikeinterface/core/base.py", line 812, in save_to_folder
    assert not folder.exists(), f"folder {folder} already exists, choose another name"
AssertionError: folder base already exists, choose another name

write_binary_recording:   0%|          | 0/9 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/test-spikeinterface-orig.py", line 37, in <module>
    recording_saved = recording_cmr.save(n_jobs=-1, folder='preprocessed',mp_context="spawn")
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rocky/.local/lib/python3.11/site-packages/spikeinterface/core/base.py", line 749, in save
    loaded_extractor = self.save_to_folder(**kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rocky/.local/lib/python3.11/site-packages/spikeinterface/core/base.py", line 825, in save_to_folder
    cached = self._save(folder=folder, verbose=verbose, **save_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rocky/.local/lib/python3.11/site-packages/spikeinterface/core/baserecording.py", line 444, in _save
    write_binary_recording(self, file_paths=file_paths, dtype=dtype, **job_kwargs)
  File "/home/rocky/.local/lib/python3.11/site-packages/spikeinterface/core/core_tools.py", line 314, in write_binary_recording
    executor.run()
  File "/home/rocky/.local/lib/python3.11/site-packages/spikeinterface/core/job_tools.py", line 400, in run
    for res in results:
  File "/home/rocky/.local/lib/python3.11/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/usr/lib64/python3.11/concurrent/futures/process.py", line 597, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

==== UPDATE ==== I left only from .postprocessing import * in spikeinterface.full and tried to figure out what could cause the problem. Import of any sub-sub modules below caused the error.

from .correlograms import (
    CorrelogramsCalculator,
    compute_autocorrelogram_from_spiketrain,
    compute_crosscorrelogram_from_spiketrain,
    compute_correlograms,
    correlogram_for_one_segment,
    compute_correlograms_numba,
    compute_correlograms_numpy,
)

from .isi import (
    ISIHistogramsCalculator,
    compute_isi_histograms_from_spiketrain,
    compute_isi_histograms,
    compute_isi_histograms_numpy,
    compute_isi_histograms_numba,
)

from .unit_localization import (
    compute_unit_locations,
    UnitLocationsCalculator,
    compute_center_of_mass,
)

Apparently, all of them import numba, so if I change the test script such as it imports numba - this creates the problem in preprocessing!

#import spikeinterface.full as si
# si.set_global_tmp_folder("spikeiteface.cache")
# from spikeinterface.core import WaveformExtractor, BaseWaveformExtractorExtension

import numba

I checked that all numba libraries have correct links.

$ for l in $(find . -name "*.so") ; do echo $l ; ldd $l ; echo ; done
./_devicearray.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffff235f000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fad19aed000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fad1976b000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fad19553000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fad19333000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fad18f6e000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fad19e82000)

./_dispatcher.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffee856c000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fac88066000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fac87ce4000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fac87acc000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fac878ac000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fac874e7000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fac883fb000)

./_dynfunc.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffe2bbac000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f43fdd00000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f43fd93b000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f43fdf20000)

./_helperlib.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffc553f1000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f48d12dc000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f48d10bc000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f48d0cf7000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f48d165e000)

./mviewbuf.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffd86bab000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5920a96000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f59206d1000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f5920cb6000)

./core/runtime/_nrt_python.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffebd7af000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f6828bb4000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f6828832000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f682861a000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f68283fa000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f6828035000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f6828f49000)

./core/typeconv/_typeconv.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffc529fb000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f4a09258000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f4a08ed6000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f4a08cbe000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4a08a9e000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f4a086d9000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f4a095ed000)

./cuda/cudadrv/_extras.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffd421fc000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fb05e3d6000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fb05e011000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fb05e5f6000)

./experimental/jitclass/_box.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffeb9e49000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5a82720000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f5a8235b000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f5a82940000)

./np/ufunc/_internal.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffecf18e000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fc7003d6000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fc7001b6000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fc6ffdf1000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fc700758000)

./np/ufunc/_num_threads.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007fffde9d9000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f94ab491000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f94ab0cc000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f94ab6b1000)

./np/ufunc/omppool.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffc97df8000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f70d3fc9000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f70d3c47000)
    libgomp.so.1.0.0 => /lib64/libgomp.so.1.0.0 (0x00007f70d3a0f000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f70d37f7000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f70d35d7000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f70d3212000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f70d435e000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f70d300e000)

./np/ufunc/tbbpool.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffcb4bc1000)
    libtbb.so.12 => not found
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f1f08463000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f1f080e1000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f1f07ec9000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f1f07ca9000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f1f078e4000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f1f087f8000)

./np/ufunc/workqueue.cpython-311-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffca77e8000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007ff110060000)
    libm.so.6 => /lib64/libm.so.6 (0x00007ff10fcde000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ff10fac6000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ff10f8a6000)
    libc.so.6 => /lib64/libc.so.6 (0x00007ff10f4e1000)
    /lib64/ld-linux-x86-64.so.2 (0x00007ff1103f5000)

and libtbb is missing. However, on my desktop computer where everything works, it's missing too. So what else can it be?!

rat-h commented 1 year ago

The problem above was solved by compiling and installing locally libffi-3.4.4 then recompiling python-3.10.12 from source. spikeinterface installed after that works as expected.

samuelgarcia commented 1 year ago

@rat-h : your fight was very honorable! Sorry for you that to spent so much time on installtion problem.

I do not like to much anaconda on linux I prefer system package + pip + venv. But sometimes it helps to have a clean and easy install with conda.

The version alignement of numba+numpy and numpy+hdbscan is sometimes very hard on linux.

Be aware that we are trying to maintain almost working anaconda environement here: https://github.com/SpikeInterface/spikeinterface/tree/main/installation_tips There are not always uptodate but this is a good start.

h-mayorquin commented 1 year ago

Amazing that you solved it!

I still wonder how did the linkage failed as it only does with that specific architecture in your virtual machine right?

rat-h commented 1 year ago

@h-mayorquin you are correct. Apparently, Rocky Linux uses lightweight libraries to make cloud images smaller. At first, I thought that the problem in lib libtbb which is present but isn't seen by python and nuba. However, it was not the source of the problem. So after that, I just had to find which libraries cause it.

h-mayorquin commented 1 year ago

I am wondering, thinking restrospectively, is there anything that we could have done at the spikeinterface level to have make this type of error easier to debug?

Any suggestion?

rat-h commented 1 year ago

Hm, it's hard to say. Maybe if we had an option to run internal tests for each required module before installation and report any errors - this can narrow the search for these kinds of errors.

h-mayorquin commented 1 year ago

We have CI testing for each modules which is maybe something that could be use to diagnose:

https://spikeinterface.readthedocs.io/en/latest/development/development.html#how-to-run-tests-locally

Maybe we could make this more prominent somewhere else in the readme like "to test your installation" section or something like that.

I am happy that could pin down your problem. I am closing this issue now.

SpikeInterface / spikeinterface

Binary file saving doesn't allow multiple jobs #1751