AllenNeuralDynamics / aind-ephys-pipeline

Code Ocean pipeline for ephys processing with Kilosort2.5
MIT License
5 stars 2 forks source link

More failed runs #5

Closed bjhardcastle closed 6 months ago

bjhardcastle commented 7 months ago

Run 9595167

Seem to be two errors: running out of memory during spike-sorting, then a KeyError.

This also seems like a repeat of https://github.com/AllenNeuralDynamics/aind-ephys-spikesort-kilosort25-full/issues/22

...
SPIKE SORTING
Sorting recording: experiment1_Record Node 101#Neuropix-PXI-100.ProbeB-AP_recording1
BinaryFolderRecording: 383 channels - 30.0kHz - 1 segments - 245,100,456 samples 
                       8,170.02s (2.27 hours) - int16 dtype - 174.85 GiB

    SPIKE SORTING FAILED!
Error log:

{'datetime': '2024-03-05T10:21:12.777335',
 'error': True,
 'error_trace': 'Traceback (most recent call last):\n'
                '  File '
                '"/usr/local/lib/python3.8/dist-packages/spikeinterface/sorters/basesorter.py", '
                'line 258, in run_from_folder\n'
                '    SorterClass._run_from_folder(sorter_output_folder, '
                'sorter_params, verbose)\n'
                '  File '
                '"/usr/local/lib/python3.8/dist-packages/spikeinterface/sorters/external/kilosortbase.py", '
                'line 217, in _run_from_folder\n'
                '    raise Exception(f"{cls.sorter_name} returned a non-zero '
                'exit code")\n'
                'Exception: kilosort2_5 returned a non-zero exit code\n',
 'run_time': None,
 'runtime_trace': ['Time   0s. Computing whitening matrix..',
                   'Getting channel whitening matrix...',
                   'Channel-whitening matrix computed.',
                   'Time  25s. Loading raw data and applying filters...',
                   'Time 1066s. Finished preprocessing 3737 batches.',
                   'pitch is 20 um',
                   '0.55 sec, 1 batches, 10000 spikes',
                   '39.90 sec, 101 batches, 1000459 spikes',
                   '79.00 sec, 201 batches, 1972411 spikes',
...
                   '1461.86 sec, 3701 batches, 36829456 spikes',
                   '1476.44 sec, 3737 batches, 37182767 spikes',
                   'time 2508.50, Shifted up/down 3737 batches.',
                   'Time 2523s. Optimizing templates ...',
                   '2523.59 sec, 1 / 3737 batches, 67 units, nspks: 40.8680, '
                   'mu: 18.9874, nst0: 532, merges: 0.0000, 0.0000, 3.2000',
                   '2577.27 sec, 101 / 3737 batches, 678 units, nspks: '
                   '6819.6765, mu: 14.4304, nst0: 7954, merges: 147.8392, '
                   '2.1090, 37.4744',
...
                   '5361.00 sec, 3601 / 3737 batches, 731 units, nspks: '
                   '11556.7722, mu: 15.2399, nst0: 11728, merges: 197.1833, '
                   '0.2381, 38.3152',
                   '5443.09 sec, 3701 / 3737 batches, 706 units, nspks: '
                   '10398.4048, mu: 14.5297, nst0: 9466, merges: 186.8666, '
                   '0.6636, 36.3103',
                   'Elapsed time is 5469.255693 seconds.',
                   'Finished learning templates',
                   'Time 5473s. Optimizing templates ...',
                   '5474.31 sec, 1 / 3737 batches, 638 units, nspks: '
                   '8175.0000, mu: 14.3925, nst0: 16025',
...
                   '7408.31 sec, 2001 / 3737 batches, 638 units, nspks: '
                   '19631557.0000, mu: 14.3925, nst0: 19915',
                   '7507.21 sec, 2101 / 3737 batches, 638 units, nspks: '
                   '20531500.0000, mu: 14.3925, nst0: 15748',
                   '----------------------------------------Out of memory.'],
...
...
VISUALIZATION time: 321.9s
[capsule-6668112] completed!
[15/f89674] Submitted process > capsule_aind_ephys_results_collector_9 (capsule-4820071)
Error executing process > 'capsule_aind_ephys_results_collector_9 (capsule-4820071)'

Caused by:
  Essential container in task exited

Command executed:

  #!/usr/bin/env bash
  set -e

  export CO_CAPSULE_ID=2fcf1c0b-df5d-4822-b078-9e1024a092c5
  export CO_CPUS=4
  export CO_MEMORY=34359738368

  mkdir -p capsule
  mkdir -p capsule/data && ln -s $PWD/capsule/data /data
  mkdir -p capsule/results && ln -s $PWD/capsule/results /results
  mkdir -p capsule/scratch && ln -s $PWD/capsule/scratch /scratch

  echo "[capsule-4820071] cloning git repo..."
  git clone "https://$GIT_ACCESS_TOKEN@codeocean.allenneuraldynamics.org/capsule-4820071.git" capsule-repo
  git -C capsule-repo checkout ef0b4bc266bb835399b39026538a5fc7d07e5757 --quiet
  mv capsule-repo/code capsule/code
  rm -rf capsule-repo

  echo "[capsule-4820071] running capsule..."
  cd capsule/code
  chmod +x run
  ./run

  echo "[capsule-4820071] completed!"

Command exit status:
  1

Command output:
  [capsule-4820071] cloning git repo...
  [capsule-4820071] running capsule...

  COLLECTING RESULTS

Command error:
  [capsule-4820071] cloning git repo...
  Cloning into 'capsule-repo'...
  [capsule-4820071] running capsule...
  + python -u run_capsule.py
  COLLECTING RESULTS
  Traceback (most recent call last):
    File "/tmp/nxf.x6JzqDJdCU/capsule/code/run_capsule.py", line 217, in <module>
      upgraded_data_description = upgrader.upgrade(platform=Platform.ECEPHYS, **additional_required_kwargs)
    File "/opt/conda/lib/python3.9/site-packages/aind_data_schema/schema_upgrade/data_description_upgrade.py", line 161, in upgrade
      modality = self.get_modality(**kwargs)
    File "/opt/conda/lib/python3.9/site-packages/aind_data_schema/schema_upgrade/data_description_upgrade.py", line 131, in get_modality
      modality = [ModalityUpgrade.upgrade_modality(m) for m in old_modality]
    File "/opt/conda/lib/python3.9/site-packages/aind_data_schema/schema_upgrade/data_description_upgrade.py", line 131, in <listcomp>
      modality = [ModalityUpgrade.upgrade_modality(m) for m in old_modality]
    File "/opt/conda/lib/python3.9/site-packages/aind_data_schema/schema_upgrade/data_description_upgrade.py", line 53, in upgrade_modality
      return Modality.from_abbreviation(old_modality["abbreviation"])
    File "/opt/conda/lib/python3.9/site-packages/aind_data_schema/models/modalities.py", line 128, in from_abbreviation
      return cls._abbreviation_map[abbreviation]
  KeyError: 'behavior'
alejoe91 commented 7 months ago

Hi Ben, I'll take a look at the key error!

I think that the pipeline shouldn't fail fast in case of a spike sorting error, especially when there are many streams available. If only one spike sorting fails, then the entire pipeline will go fine for the others, plus the visualization will plot the traces that might be importanti for trouble shooting.

alejoe91 commented 7 months ago

The KeyError is unrelated and that is triggering the failure. It is due to a mismatch in the aind-data-schema that I'm investigating now.

For the spike sorting failure, I think it's the GPU RAM that's running out of memory, since it's happening during the Optimizing templates step. Not sure how we can select larger GPUs on CodeOcean.

alejoe91 commented 7 months ago

This depends on https://github.com/AllenNeuralDynamics/aind-metadata-upgrader/pull/25

alejoe91 commented 6 months ago

Fixed :)