BouchardLab / nsds_lab_to_nwb

Python package to convert NSDS Lab data to NWB files.
https://nsds-lab-to-nwb.readthedocs.io/en/latest/
0 stars 4 forks source link

Resample dimension exceeds 2^31 - bytes #107

Closed jthermiz closed 3 years ago

jthermiz commented 3 years ago

I believe this is a known issue in process_nwb, but I get this when trying to run the pipeline with the resample flag set to True

read from t=0s to t=1463.58s
Traceback (most recent call last):
  File "generate_nwb.py", line 51, in <module>
    nwb_content = nwb_builder.build()
  File "/home/jhermiz/software/nsds_lab_to_nwb/nsds_lab_to_nwb/nwb_builder.py", line 209, in build
    self.neural_data_originator.make(nwb_content, electrode_table_regions)
  File "/home/jhermiz/software/nsds_lab_to_nwb/nsds_lab_to_nwb/components/neural_data/neural_data_originator.py", line 77, in make
    data = self.resample(data)
  File "/home/jhermiz/software/nsds_lab_to_nwb/nsds_lab_to_nwb/components/neural_data/neural_data_originator.py", line 50, in resample
    new_data = resample(data, new_freq, rate)
  File "/home/jhermiz/.conda/envs/nsds_nwb/lib/python3.7/site-packages/process_nwb/resample.py", line 122, in resample
    Xds = resample_func(X, new_n_time, npad=npad, real=real)
  File "/home/jhermiz/.conda/envs/nsds_nwb/lib/python3.7/site-packages/process_nwb/resample.py", line 77, in resample_func
    X_fft = rfft(X, axis=0)
  File "/home/jhermiz/.conda/envs/nsds_nwb/lib/python3.7/site-packages/mkl_fft/_numpy_fft.py", line 414, in rfft
    (x,), {'n': n, 'axis': axis})
  File "/home/jhermiz/.conda/envs/nsds_nwb/lib/python3.7/site-packages/mkl_fft/_numpy_fft.py", line 103, in trycall
    raise ve
  File "/home/jhermiz/.conda/envs/nsds_nwb/lib/python3.7/site-packages/mkl_fft/_numpy_fft.py", line 98, in trycall
    res = func(*args, **kwrds)
  File "mkl_fft/_pydfti.pyx", line 792, in mkl_fft._pydfti.rfft_numpy
  File "mkl_fft/_pydfti.pyx", line 701, in mkl_fft._pydfti._rc_fft1d_impl
ValueError: Internal error occurred: b'Intel MKL DFTI ERROR: Data size of one of transform dimensions exceeds 2^31 - 1 bytes'
jthermiz commented 3 years ago

I think found an alternative resampling function doesn't produce this error. The spikeinterface resampling method (underlying implementation is scipy.signal.resample) did not crash for RVG16_B06. The output is saved to /clusterfs/NSDS_data/nwb/RVG16/RVG16_B06_resampled.py. I implemented this approach on my fork in the no_resample branch: https://github.com/jthermiz/nsds_lab_to_nwb/tree/no_resample. However, I'm not sure if the resampling method outputs desirable results. I believe they do but we should check.

@izzy-baldacci Can you please compare these resampling functions process_nwb.resample.resample vs spiketoolkit.preprocessing.resample (and perhaps scipy.signal.resample) packages. Using a small snippet of data, do you see any difference between the signals? If so, can you characterize the differences by plotting the residuals over time and as a distribution. It may also be informative to plot the power spectral density of these signals.

Resampling to the nearest kHz takes forever! To make our lives easier, I strongly advocate we only resample once during the analysis phase (eg. from 12.4kHz to 3kHz). @KrisBouchard What would it take for your to be convinced that fractional sample rates are not evil?

Finally scipy.signal.resample by default uses a time domain approach, but there is also FFT approach. I believe that the process_nwb.resample was adapted from the FFT approach. @JesseLivezey Is that right? And if so, is there a reason you favored the FFT based approach?

One last thought is that if we end up using an FFT based approach, does it make sense to directly resample in the frequency domain while doing the wavelet transform (ie. only transform a subset of frequency)? That away we only convert to frequency space once as opposed to ping ponging back and forth between the time and frequency domains, which is inefficient. As we introduce neuropixels, we will have potentially 10x more data. So to the extent we can make the pipeline scalable, the easier our lives will be :)

JesseLivezey commented 3 years ago

@jthermiz I think that's a new error :(. Which block is this on?

These problems seem to happen for the mkl fft library. A few years ago the mkl library was sometimes 10-100x faster than the default numpy or scipy routines. It might make sense to do a little benchmark and just use numpy or scipy if they don't have as many problems. I can run a few tests. It looks like scipy has some parallelization. which could also help.

The resample will be by far the slowest operation for NWB conversion with the current pipeline. I'd also advocate for not resampling, if that's acceptable. We'd eventually have to resample, but it would be during the wavelet steps, which are also slow.

@jthermiz I think the scipy.signal.resample docs are a little confusing. It does frequency domain resampling. The domain='time' argument specifies whether the input is in the time domain or if it has already had a fft applied to it. See here https://github.com/scipy/scipy/blob/44e31b1eb202ef91caad059952ba268d2610a60b/scipy/signal/signaltools.py#L3028

Doing time-domain resampling is extremely slow for signals of this length.

process_nwb.resample does the same basic operations as scipy.signal.resample, but we do padding before resampling. If you don't pad the signals, you can get some pretty big artifacts from the frequency domain resample near the edges of the resampled signal, which is where baselines are often computed from. It might also be possible to zero pad using the scipy routines without making a copy of the array. Needs some testing.

jthermiz commented 3 years ago

@JesseLivezey I ran into this error when trying to convert this block: RVG16_B06

Re: time vs frequency domain: That makes sense.

Re padding: Depending on how long the artifact lasts for and alternative is to throw away samples in the beginning and end of the recording. It usually takes us 5-10 sec to press play on dvd player (the play button on the dvd player is not so responsive these days...). If the artifact last for < 1 sec, I'd vote for karate chopping off dt (where dt <= 1 sec) at the beginning and end of the data. :martial_arts_uniform:

JesseLivezey commented 3 years ago

I'm going to switch process_nwb to using scipy's fft for now. It doesn't have a problem with that block and it's reasonably fast. I'll cut a new pip version so the default install method works.

https://github.com/IntelPython/mkl_fft/issues/62

JesseLivezey commented 3 years ago
pip uninstall process-nwb
pip install process-nwb

should fix this.

jthermiz commented 3 years ago

I was able to resample a full tone ECoG (128 channels @ 12.xxx kHz) down to 4kHz on catscan.