NeuralEnsemble / python-neo

Neo is a package for representing electrophysiology data in Python, together with support for reading a wide range of neurophysiology file formats
http://neo.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
322 stars 247 forks source link

Question - What is in an OpenEphysIO header? #1488

Closed EliseA-UoB closed 3 months ago

EliseA-UoB commented 4 months ago

Hi all!

I'm building an analysis pipeline on my OpenEphys data, which is in the legacy format (i.e., .continuous).

I've been using the neo.io.OpenEphysIO class to handle the data, but have found it to be tediously slow to open the files every time I want to look at the data: reader = neo.io.OpenEphysIO(directory_2_files)

It seems that it is trying to load all the experiment data (i.e., recordings from all the channels) into the reader, because the reader is very large!

Can you confirm if that's what it is doing? And if so, is there any way to ask the neo handler to not load/hold on to all the data at once?

Thanks, Elise

zm711 commented 4 months ago

All great questions @EliseA-UoB !

Have you tried are rawio layer? Maybe this would be a little faster. The general idea is that at the rawio layer we read in the header. This is slow and based on the format. But we need to read the header first to know how the data is organized, dtype etc. We can't really speed that up because even if you only want one bit of data we likely still need to make it all the way through the header.

Once we've done the header reader we make memmaps of all the data. This can be a little slow for giant files or when there are a lot of files. But once we have this done the maps are all lazy. So we only get the real data when you request it. (Ie if we were loading all the data it would actually be much much slower and use up your RAM).

Could you tell us the size of your directory, how long it's taking, and if it speeds up using

reader = neo.rawio.OpenEphysIO(directory_2_files)
reader.parse_header()

instead?

EliseA-UoB commented 4 months ago

Thanks @zm711 !

The directory for the experiment is 19.3 GB.

I have tried rawio layer, and found it wasn't much better sadly. Below is my code and results:

print("IO")
tic = time.time()
reader = neo.io.OpenEphysIO(dir_raw)
toc = time.time()
print("Time elapsed:", toc-tic)

print("RawIO")
tic = time.time()
reader = neo.rawio.OpenEphysRawIO(dir_raw)
reader.parse_header()
toc = time.time()
print("Time elapsed:", toc-tic)

Output:

IO
Time elapsed: 55.430652379989624
RawIO
Time elapsed: 42.46205925941467

Which isn't too bad, but I don't want to recreate the reader every time I revisit an experiment. I'm tempted to try saving the reader (even though it's huge) so that I can skip the slow memmap creation - do you think that would work?

zm711 commented 4 months ago

Hey @EliseA-UoB,

Sorry was on vacation, but I am back now. What do you mean by save the reader? The memmap lives in RAM to provide addresses to disk so a memmap isn't persistent. Depending on what you want to do you could harvest the raw data you want and store that data into a better format. For example spikeinterface uses neo under-the-hood to read files and then gives you multiprocessing saving options. So that could be a solution to your problem. Once you save in a spikeinterface format it will be faster to load because you won't have the openephys header's to get through. Would you be willing to try that? I can walk you through it!

samuelgarcia commented 4 months ago

Hi @EliseA-UoB. Thank you for reporting. This two timing are really strange OpenEphysRawIO should be super fast. I suspect that you dataset digitals stream. Alessio introuced some times ago an event detector in the init here that go to entire file. https://github.com/NeuralEnsemble/python-neo/blob/master/neo/rawio/openephysbinaryrawio.py#L184 Could you try to install from source and comment theses lines ?

If the code is faster then you should put an option for faster readeing in openephys binary.

@alejoe91

zm711 commented 4 months ago

@samuelgarcia -- I thought .continuous was legacy? Not the binaryrawio? Am I wrong?

alejoe91 commented 4 months ago

@zm711 you're correct. This is legacy Open Ephys format

samuelgarcia commented 4 months ago

oups. sorry. forget it so. I am tired.

EliseA-UoB commented 4 months ago

Thanks all!

@zm711 - Hope you had a great vacation! Thanks for the idea on spikeinterface - if you have any example scripts you can share, that'd be great!

zm711 commented 4 months ago

Thanks.

something like:

import spikeinterface.extractors as se

# see docstring info below to fill out the other necessary info
ephys_data = se.read_openephys(folder_path='xx', xx) 

# format can be binary or zarr (need extra stuff but has compression), 
# n_jobs would be how much multiprocessing you want to use
saved_ephys_data = ephys_data.save(format='binary', folder='path/to/save', n_jobs=xx)

so the parameters to read openephys are:


    Parameters
    ----------
    folder_path : str
        The folder path to load the recordings from
    stream_id : str, default: None
        If there are several streams, specify the stream id you want to load
    stream_name : str, default: None
        If there are several streams, specify the stream name you want to load
    block_index : int, default: None
        If there are several blocks (experiments), specify the block index you want to load
    all_annotations : bool, default: False
        Load exhaustively all annotation from neo
    ignore_timestamps_errors : None
        Deprecated keyword argument. This is now ignored.
        neo.OpenEphysRawIO is now handling gaps directly but makes the read slower.
    """

Feel free to ask if you don't understand something, but they basically map to neo arguments.

So basically two lines of code. Now if you have multiple streams of data you would save each one into a separate binary file through spikeinterface. But this should prevent a bunch of the slow header parsing that OpenEphys requires. Additionally you could do preprocessing and only save the final output if you're interested.

import spikeinterface.preprocessing as spre
ephys_data = se.read_openephys(xx)

filtered_ephys_data = spre.bandpass_filter(ephys_data, min_freq=300, max_freq=6000)
saved_ephys_data = filtered_ephys_data.save(xx)
EliseA-UoB commented 4 months ago

I'll give it a go and report back, thank you!

zm711 commented 3 months ago

I'll close this for now, but re-open if this doesn't work.