SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.
https://spikeinterface.readthedocs.io
MIT License
533 stars 187 forks source link

Stray .nfs file preventing deletion of waveforms extension folder #3348

Open jonahpearl opened 3 months ago

jonahpearl commented 3 months ago

Hi all — I'm on a Linux / NFS system (CentOS 7 specifically, on my school's HPCC), and I encounter an OSError every time I try to re-compute the waveforms extension of a sorting analyzer. The core problem is that there's a stray .nfs temporary file hanging out in the waveforms folder. If I do ls -a [that folder] I can see it clearly, it's always something like .nfs9eb83351caf602ec00005734.

I can see that it's python using the file with lsof:

COMMAND   PID    USER   FD   TYPE DEVICE  SIZE/OFF                 NODE NAME
python  30135 jop9552  mem    REG   0,45 128896688 11436947680097862380 [...]/analyzer_folder/extensions/waveforms/.nfs9eb83351caf602ec00005734
python  30135 jop9552   93u   REG   0,45 128896688 11436947680097862380 [...]/analyzer_folder/extensions/waveforms/.nfs9eb83351caf602ec00005734

and if I call kill on that process, my notebook's kernel crashes, so I'm pretty sure it's coming from spikeinterface.

The offending function is at /spikeinterface/core/sortinganalyzer.py:2090: shutil.rmtree(extension_folder).

Traceback ``` --------------------------------------------------------------------------- OSError Traceback (most recent call last) Cell In[6], line 1 ----> 1 analyzer.compute("waveforms") File ~/datta-lab/spikeinterface/src/spikeinterface/core/sortinganalyzer.py:1155, in SortingAnalyzer.compute(self, input, save, extension_params, verbose, **kwargs) 1108 """ 1109 Compute one extension or several extensiosn. 1110 Internally calls compute_one_extension() or compute_several_extensions() depending on the input type. (...) 1152 1153 """ 1154 if isinstance(input, str): -> 1155 return self.compute_one_extension(extension_name=input, save=save, verbose=verbose, **kwargs) 1156 elif isinstance(input, dict): 1157 params_, job_kwargs = split_job_kwargs(kwargs) File ~/datta-lab/spikeinterface/src/spikeinterface/core/sortinganalyzer.py:1232, in SortingAnalyzer.compute_one_extension(self, extension_name, save, verbose, **kwargs) 1229 assert ok, f"Extension {extension_name} requires {dependency_name} to be computed first" 1231 extension_instance = extension_class(self) -> 1232 extension_instance.set_params(save=save, **params) 1233 if extension_class.need_job_kwargs: 1234 extension_instance.run(save=save, verbose=verbose, **job_kwargs) File ~/datta-lab/spikeinterface/src/spikeinterface/core/sortinganalyzer.py:2053, in AnalyzerExtension.set_params(self, save, **params) 2047 """ 2048 Set parameters for the extension and 2049 make it persistent in json. 2050 """ 2051 # this ensure data is also deleted and corresponf to params 2052 # this also ensure the group is created -> 2053 self._reset_extension_folder() 2055 params = self._set_params(**params) 2056 self.params = params File ~/datta-lab/spikeinterface/src/spikeinterface/core/sortinganalyzer.py:2028, in AnalyzerExtension._reset_extension_folder(self) 2026 extension_folder = self._get_binary_extension_folder() 2027 if extension_folder.is_dir(): -> 2028 shutil.rmtree(extension_folder) 2029 extension_folder.mkdir(exist_ok=False, parents=True) 2031 elif self.format == "zarr": File ~/miniconda3/envs/spikeinterface/lib/python3.9/shutil.py:740, in rmtree(path, ignore_errors, onerror) 738 os.rmdir(path) 739 except OSError: --> 740 onerror(os.rmdir, path, sys.exc_info()) 741 else: 742 try: 743 # symlinks to directories are forbidden, see bug #1669 File ~/miniconda3/envs/spikeinterface/lib/python3.9/shutil.py:738, in rmtree(path, ignore_errors, onerror) 736 os.close(fd) 737 fd_closed = True --> 738 os.rmdir(path) 739 except OSError: 740 onerror(os.rmdir, path, sys.exc_info()) OSError: [Errno 39] Directory not empty: '[...]/analyzer_folder/extensions/waveforms' ```

MRE:

recording = si.load_extractor([any recording])
sorting = si.read_sorter_folder([any sorting])
analyzer = si.create_sorting_analyzer(
    sorting=sorting, 
    recording=recording, 
    format='binary_folder', 
    folder="./tmp_analyzer",
)
to_compute = ["random_spikes", "waveforms"]
analyzer.compute(to_compute)
analyzer.compute("waveforms")  # tries to reset folder before recomputing but fails at deletion

Not sure how to go about debugging this but any advice is welcome. This is v0.101.0. Thanks!

zm711 commented 3 months ago

What is the OSError that is actually given :) I don't see it in the issue.

h-mayorquin commented 3 months ago

It is probably in the recording and sorting that keeps a reference file handle. Probably related to:

https://github.com/SpikeInterface/spikeinterface/issues/3295

jonahpearl commented 3 months ago

What is the OSError that is actually given :) I don't see it in the issue.

Added the traceback :)

jonahpearl commented 3 months ago

Also the same issue happens if I load sorting output from mountainsort5, so I don't think it's related to KS4 as being discussed in #3295.

zm711 commented 3 months ago

What is the actually context of this use (ie I could imagine recomputing waveforms later, but why do compute(waveforms) compute(waveforms) twice in a row?)? Does the error still happen if you do:

analyzer.compute(['random_spikes', 'waveforms', 'templates'])

The waveforms file could be left open I guess during writing, so I'm wondering if computing another extensions right after waveforms ensures the file is closed.

zm711 commented 3 months ago

Looking here https://github.com/SpikeInterface/spikeinterface/blob/007b6efd0e675e60658199112d0541fba2b432cd/src/spikeinterface/core/waveform_tools.py#L194C45-L194C53

it seems that the memmap for writing the waveforms is open but I haven't found an explicit close yet? But I think @samuelgarcia or @alejoe91 know this part of the code base the best to know if the close is located somewhere else. But maybe this is causing your specific problem.

jonahpearl commented 3 months ago

The context is, I'm developing code about the analyzer extensions and sometimes just end up running the same code twice :p The MRE just reproduces it for me, it isn't necessarily how one would come across it in the wild. And no, I've encountered this error with a large list of extensions being computed after waveforms.

If I compute the extensions once, then restart the kernel and reload the analyzer, I can get away with an initial re-compute of waveforms with no errors. But then on the second try of a given kernel there's usually an error, although sometimes it also takes until the third run (??).

zm711 commented 3 months ago

Would you be willing to try the same with zarr. the binary_folder backend uses the memmap whereas zarr is using the shared_memory. It might be worth testing. If zarr works with no problems then at least from my perspective this would point more strongly to the waveforms.npy file not being closed.

jonahpearl commented 3 months ago

Yup, you're exactly right. It works as expected with zarr, no OSError.

zm711 commented 3 months ago

Okay, then I think we would need to close the memmap to fix that, but I think the general idea would be we create the memmap and leave it open so you can use the waveforms for other extensions without needing to reopen it./reload the data So I'm not sure if/when we could do that. We will just have to wait to see what Alessio and Sam think for this.

samuelgarcia commented 3 months ago

Thanls for reporting and investigating. I will have a look into this.