Closed EliseA-UoB closed 3 months ago
All great questions @EliseA-UoB !
Have you tried are rawio
layer? Maybe this would be a little faster. The general idea is that at the rawio layer we read in the header. This is slow and based on the format. But we need to read the header first to know how the data is organized, dtype etc. We can't really speed that up because even if you only want one bit of data we likely still need to make it all the way through the header.
Once we've done the header reader we make memmaps of all the data. This can be a little slow for giant files or when there are a lot of files. But once we have this done the maps are all lazy. So we only get the real data when you request it. (Ie if we were loading all the data it would actually be much much slower and use up your RAM).
Could you tell us the size of your directory, how long it's taking, and if it speeds up using
reader = neo.rawio.OpenEphysIO(directory_2_files)
reader.parse_header()
instead?
Thanks @zm711 !
The directory for the experiment is 19.3 GB.
I have tried rawio layer, and found it wasn't much better sadly. Below is my code and results:
print("IO")
tic = time.time()
reader = neo.io.OpenEphysIO(dir_raw)
toc = time.time()
print("Time elapsed:", toc-tic)
print("RawIO")
tic = time.time()
reader = neo.rawio.OpenEphysRawIO(dir_raw)
reader.parse_header()
toc = time.time()
print("Time elapsed:", toc-tic)
Output:
IO
Time elapsed: 55.430652379989624
RawIO
Time elapsed: 42.46205925941467
Which isn't too bad, but I don't want to recreate the reader every time I revisit an experiment. I'm tempted to try saving the reader (even though it's huge) so that I can skip the slow memmap creation - do you think that would work?
Hey @EliseA-UoB,
Sorry was on vacation, but I am back now. What do you mean by save the reader? The memmap lives in RAM to provide addresses to disk so a memmap isn't persistent. Depending on what you want to do you could harvest the raw data you want and store that data into a better format. For example spikeinterface uses neo under-the-hood to read files and then gives you multiprocessing saving options. So that could be a solution to your problem. Once you save in a spikeinterface format it will be faster to load because you won't have the openephys header's to get through. Would you be willing to try that? I can walk you through it!
Hi @EliseA-UoB.
Thank you for reporting.
This two timing are really strange OpenEphysRawIO
should be super fast.
I suspect that you dataset digitals stream.
Alessio introuced some times ago an event detector in the init here that go to entire file.
https://github.com/NeuralEnsemble/python-neo/blob/master/neo/rawio/openephysbinaryrawio.py#L184
Could you try to install from source and comment theses lines ?
If the code is faster then you should put an option for faster readeing in openephys binary.
@alejoe91
@samuelgarcia -- I thought .continuous was legacy? Not the binaryrawio? Am I wrong?
@zm711 you're correct. This is legacy Open Ephys format
oups. sorry. forget it so. I am tired.
Thanks all!
@zm711 - Hope you had a great vacation! Thanks for the idea on spikeinterface - if you have any example scripts you can share, that'd be great!
Thanks.
something like:
import spikeinterface.extractors as se
# see docstring info below to fill out the other necessary info
ephys_data = se.read_openephys(folder_path='xx', xx)
# format can be binary or zarr (need extra stuff but has compression),
# n_jobs would be how much multiprocessing you want to use
saved_ephys_data = ephys_data.save(format='binary', folder='path/to/save', n_jobs=xx)
so the parameters to read openephys are:
Parameters
----------
folder_path : str
The folder path to load the recordings from
stream_id : str, default: None
If there are several streams, specify the stream id you want to load
stream_name : str, default: None
If there are several streams, specify the stream name you want to load
block_index : int, default: None
If there are several blocks (experiments), specify the block index you want to load
all_annotations : bool, default: False
Load exhaustively all annotation from neo
ignore_timestamps_errors : None
Deprecated keyword argument. This is now ignored.
neo.OpenEphysRawIO is now handling gaps directly but makes the read slower.
"""
Feel free to ask if you don't understand something, but they basically map to neo arguments.
So basically two lines of code. Now if you have multiple streams of data you would save each one into a separate binary file through spikeinterface. But this should prevent a bunch of the slow header parsing that OpenEphys requires. Additionally you could do preprocessing and only save the final output if you're interested.
import spikeinterface.preprocessing as spre
ephys_data = se.read_openephys(xx)
filtered_ephys_data = spre.bandpass_filter(ephys_data, min_freq=300, max_freq=6000)
saved_ephys_data = filtered_ephys_data.save(xx)
I'll give it a go and report back, thank you!
I'll close this for now, but re-open if this doesn't work.
Hi all!
I'm building an analysis pipeline on my OpenEphys data, which is in the legacy format (i.e., .continuous).
I've been using the
neo.io.OpenEphysIO
class to handle the data, but have found it to be tediously slow to open the files every time I want to look at the data:reader = neo.io.OpenEphysIO(directory_2_files)
It seems that it is trying to load all the experiment data (i.e., recordings from all the channels) into the reader, because the
reader
is very large!Can you confirm if that's what it is doing? And if so, is there any way to ask the neo handler to not load/hold on to all the data at once?
Thanks, Elise