Move expensive parse_header's to read

NeuralEnsemble / python-neo

Neo is a package for representing electrophysiology data in Python, together with support for reading a wide range of neurophysiology file formats

http://neo.readthedocs.io/en/latest/

BSD 3-Clause "New" or "Revised" License

324 stars 248 forks source link

Move expensive parse_header's to read #1411

Open bendichter opened 9 months ago

bendichter commented 9 months ago

Is your feature request related to a problem? Please describe. Initializing PlexonRawIO can take 20 minutes to parse the headers for a real file. This is a problem e.g. in NWB GUIDE, where we need to get metadata. This requires initialization of every session, which would amount to making the user wait for hours in the middle of building the conversion just to fetch the metadata from each session.

Describe the solution you'd like I would like to keep the metadata parsing in the constructor and move the expensive heading to the read command. This would use caching to ensure it is only executed once, even when the read command is used multiple times. This would not improve the speed, but would delay the long wait until a time that is much better for usability.

Describe alternatives you've considered It would be great to also improve the efficiency of the header parsing code if possible.

samuelgarcia commented 9 months ago

Hi @bendichter. parse_header is more or less : sampling rate + number of channel + t_start + number of sample + metadata.

For some format getting number of sample can take a while because we need to parse with very bad loop the entire file jumping from data block to data block.

We we need this number of sample before the read (get_analog_signal_chunk).

In short what you propose is decomposing the get header in 2 parts : all metadata and then the number of sample. I think this is quite a big internal API change. What could be done is to improve the parse header of some antic readr like plexon one and use numba for jumping data block to data block.

bendichter commented 9 months ago

Thanks for the additional context. Let me explain why this is important for us. When we built NeuroConv and NWB GUIDE on top of NEO, we imagined that construction of a NeoRawIO object would be fast (let's say <10 seconds) for each file, and we built workflows around this assumption. In most cases it is true, but in some cases, Plexon in particular, where this causes a big usability problem. If this were an obscure data format, I would be inclined to just let it be, but Plexon is quite popular and I think this will end up causing a large usability problem for many users.

In the GUIDE, we extract as much metadata as we can from the source files and then have the user fill in the rest themselves. If you want the session start time for a Plexon session, the current workflow is to first initialize the converter, then call .get_metata() on an interface object, e.g.:

interface = PlexonInterface(file_path="file.plx")
interface.get_metadata()["NWBFile"]["session_start_time"]

PlexonInterface.__init__ internally calls PlexonRawIO.__init__, so there is currently no way around this long wait, (20 minutes per file, which could mean hours). This causes a big problem for us, as this is in the middle of a workflow when a user is meant to add metadata. I suppose one solution would be to delay the initialization of the neo reader until after initialization of the PlexonInterface and write our own code for parsing the session start time and any other metadata. But it seems to me that having a PlexonRawIO constructor that takes 20 minutes might not great for neo in general.

What if in get_analog_signal_chunk

https://github.com/NeuralEnsemble/python-neo/blob/0c175b4ece81cc588971a02069af5d87e91f9cb6/neo/rawio/plexonrawio.py#L377-L381

before data_blocks = self._data_blocks[5][chan_id] you had something like

if self._data_blocks is None:
    self._data_blocks = self._parse_data_blocks()

zm711 commented 8 months ago

@bendichter,

A couple more points.

Heberto added in the flag is_header_parsed that can be checked to make sure you don't double parse data. It seems like that is less of a concern for you since you really would prefer metadata vs get sample/data shape to be separate. But this new flag may be helpful as you work (as long as you are >0.13 for neo).
I don't know NWBGuide at all, but is there a way that you guys could switch it to async, and then you can have parsing occurring while also collecting metadata? For something like an app/gui framework it would make sense (to me) to have multiple things going on at once rather than trap the user to do one step at a time.
I'm looking through the Plexon header parsing and it doesn't look super easy to speed up. I'm not sure how we could incorporate numba into this @samuelgarcia. I'm sure you had some vision in mind, but in general based on my reading numba doesn't help with io questions at all and I'm not sure if (although correct me if I'm wrong) that numba can handle the complex dtypes that so many rawios rely on for header parsing. When I was on looking at issue tracker stuff the advice for IO was write a c extension if you really need fast io.

These thoughts are a little inchoate, so my apologies, but just trying to brainstorm right now since it would be both an api change as well as a schema change to split up the header parsing. And for my workflows (with neo and with spikeinterface) I'm okay with a long wait right at the beginning, but I totally understand why you wouldn't want to wait around.

CodyCBakerPhD commented 8 months ago

I don't know NWBGuide at all, but is there a way that you guys could switch it to async, and then you can have parsing occurring while also collecting metadata? For something like an app/gui framework it would make sense (to me) to have multiple things going on at once rather than trap the user to do one step at a time.

The GUIDE is a graphical wrapper around NeuroConv operations, which has the following multi-step workflow for a RecordingInterface like Plexon

i) initialize SpikeInterface recording extractor (I believe the contentious thing here is that neo is parsing header metadata as part of the initialization

ii) (optional) fetch default metadata from the interface. For neo-reliant interfaces, this granted simply retrieves the already parsed metadata

iii) (optional) user modifies or confirms metadata by passing it back to the interface (include file-wide metadata, or cross-object metadata not necessarily just the recording specific stuff). The default just passes back everything that was fetched automatically in (ii).

iv) the user calls run_conversion, which constructs all the in-memory NWBFile objects and then writes the file to disk

The GUIDE makes (ii) and (iii) far more intuitive since we can visually display all information with detailed explanations of what various fields mean, which makes it a very interactive process

So for

then you can have parsing occurring while also collecting metadata?

I'm not sure what kind of workflow you're envisioning ; the parsing of a lightweight metadata header, as I understand, is itself a part of the 'collecting metadata' step (the values in the file provide many defaults such as channel names, electrode group structure, gains, even the session start time)

So we can't go from 'select plexon interface' to 'specify source file for plexon' (which then initializes neo object) to 'edit metadata for plexon' asynchronously since the steps are dependent

CodyCBakerPhD commented 8 months ago

I'll point out maybe the confusion is due a mix in understand what a parse_header function should do

As I read it for something like like SpikeGLX, this should just mean 'read the .meta file and form a dictionary out of it'

Or for Intan, 'read the first X number of bytes at the top of the file and form a dictionary out of it'

But my impression for Plexon is that it also scans all the packets of data collecting some summary info and that is the expensive step, right? So it's doing something fundamentally different than other more straightforward header parsing does

I've not delved into the code though so let me know if I'm way off on this

Also: a really cheap way of getting better speed without going to C is to offer a parallelized version of the procedure, if its parallelizable

zm711 commented 8 months ago

In the GUIDE, we extract as much metadata as we can from the source files and then have the user fill in the rest themselves

In Ben's message I thought he said you were collecting some metadata from the user. So you can have the 20min long parsing of the file based metadata + the rest of neo's parsing (number of samples memmap etc) + also collect the user based metadata so that them inputting various info is occupying the same time rather than waiting the full 20 minutes to finish before being able to work on (the rest of) metadata. But based on your (iii) maybe it is actually that you allow them to modify the metadata rather than just collect metadata not saved in the file?

(NB I'm referring to file-based metadata which Neo is grabbing and user-based which would be like animal genotype, name of experiment which only the user could provide as two separate things).

As I read it for something like like SpikeGLX, this should just mean 'read the .meta file and form a dictionary out of it'

Or for Intan, 'read the first X number of bytes at the top of the file and form a dictionary out of it'

But my impression for Plexon is that it also scans all the packets of data collecting some summary info and that is the expensive step, right? So it's doing something fundamentally different than other more straightforward header parsing does

The parse_header does two things. It creates a dictionary of the metadata for example for SpikeGLX all metadata is actually stored in the signals_info_list within the reader. But it also uses parts of that metadata to generate the memory map of the file. This involves getting the total number of samples and fitting those samples to the data structure (which can be super slow for an inefficient format like Plexon). When I use Intan my parse_header time often takes 5-10 minutes (although I often do it over the network which slows things down). Multiple rawio._parse_header() functions do the full parsing as I just described, but for some formats this process is fast and for some slow. spikeglx just as the metafile which is easy to generate the dictionary from and then binary files which can be pretty rapidly mapped, other files are not as easy to do.

Again @samuelgarcia is the authority on the rawio flow and goals so I would defer to him if he wants to correct anything I've just said. Conceptually I understand what you want, but I think the should in this case was that parse_header did all the long hard work at the beginning so the user would never have to do that again. So based on the original schema Plexon is doing what it is suppose to it is just an extreme example of how slow that strategy can be.

Also: a really cheap way of getting better speed without going to C is to offer a parallelized version of the procedure, if its parallelizable

This is beyond me so if Sam knew how to do this it might be possible, but I would have to read more to feel like I could safely do this. Currently the idea is you read a block of the file using numpy functions serially, but with parallel I don't know how I could make sure we are safely going through the data blocks.

Hope you're well @CodyCBakerPhD :)

CodyCBakerPhD commented 8 months ago

also collect the user based metadata so that them inputting various info is occupying the same time rather than waiting the full 20 minutes

It doesn't take 20 minutes to type in the info for those forms; and unless neo somehow provides to us what fields are in the files prior to parsing, we can't let the user know 'this field will be inferred from your Plexon files'

The long expensive part of the conversion process is after you hit the 'run conversion' button, which is why we're pushing to have anything that needs to iterate over data packets occur then.

The parse_header does two things.

Thanks for the explanation. Shocking to hear that you don't have sufficient information to form a memory map from a very small amount of top-level text values (all you ought to need is dtype, column order, and shape?)

Could you get the shape from a single packet, then use the packet size as a heuristic to multiply by the number of packets?

Otherwise, what about a pre-step that compiles and caches a neo-required fields from an inefficient format, such that the primary workflow can then always expect the information in either a lightweight text file like SpikeGLX, or the precomputed lightweight file for others

zm711 commented 8 months ago

It doesn't take 20 minutes to type in the info for those forms

I know. Just trying to occupy some of the time so it doesn't feel as long. But not knowing what to fill in until after parsing would make this unhelpful.

Thanks for the explanation. Shocking to hear that you don't have sufficient information to form a memory map from a very small amount of top-level text values (all you ought to need is dtype, column order, and shape?)

For simple file formats that is all you need. (Honestly we always seem to have 'c' column order. I don't know why, but makes sense that most files would just use 'c' order). + offset (to account for the header). So dtype, column order, shape, and offset. But for complicated formats you need to traverse the file to get this info. Heberto was telling me that Sam told him (so lots of heresy) that one of the files that Neo supports has its header in the middle of the file, which means the file needs to be searched to find the header which will take time as well (people should really follow the convention that the header is at the head of the file). So a single "packet" is not always equivalent to all packets.

This being said I ran a couple tests just now see below:

I did just run a test and a 100mb Intan file was parsing in 100ms vs a 10mb Plexon file was taking 1.56 seconds. So we are looking at an 100x slower if the files were the same size (assuming they scale). I think the issue is that some file formats have nice packets and some don't. For Plexon the different sets of data are set-up like this (this is just to get the dtype information for files in the different parts of the headers: https://github.com/NeuralEnsemble/python-neo/blob/0c175b4ece81cc588971a02069af5d87e91f9cb6/neo/rawio/plexonrawio.py#L511C1-L613C14.

Again just comparing intan and plexon, intan makes one memory map with only one global offset for the header. Plexon makes 5 memory maps where we need to adjust the offset not based on just the global header, but the sub-headers. Then we do the final data cleanup which I assume is part of the slow part since it is a nested for loop.

I think one variable between RawIOs that may also effect this is that some RawIOs try to pull out events, spikes, and waveforms at the same time. This often requires traversing the data. Currently intan only pulls the signal and doesn't spend time traversing for things like spike waveforms which plexon does. If people want to use Neo as a standalone it makes sense to pull out all that info. But for something like spikeinterface or neuroconv it is probably not useful to preload these when you want the raw data and you can find spikes yourself later. So maybe that is contributing. --don't want to delete this but I just tested and I went from an average of 1.56 seconds to 1.52 so this idea is not contributing. Shutting off tqdm led to a 1.16 x speed up, though.

Some other quick facts, parsing the intan header (for metadata) took ~18ms, making the memap took 200 microseconds, so the other 80ms in this case is coming from rawio overhead. It might be a decent idea to run a more sophisticated profile for your Plexon example data so we can better figure out where things are getting hung up.

Just ran cProfile and got that for 2s (with the cprofile overhead) that the breakdown for Plexon is: _parse_header by itself it contributing 0.824 seconds numpy.view is 0.619 seconds (including inner function calls) numpy memmap is 1.0 seconds (including inner function calls, but is split .5 for __array_finalize__ and .5 __getitem__) a few more numpy methods 0.2 seconds then everything is basically 1 ms or less.

zm711 commented 8 months ago

Otherwise, what about a pre-step that compiles and caches a neo-required fields from an inefficient format, such that the primary workflow can then always expect the information in either a lightweight text file like SpikeGLX, or the precomputed lightweight file for others

Forgot to respond to this part. That could work, but again that's a bit more of an api level change. Could be nice though. We would really need to profile plexon though.

bendichter commented 8 months ago

@zm711

Just trying to occupy some of the time so it doesn't feel as long

I get the idea, but also note we are talking 20 minutes per session, and there could easily be 20 sessions in a conversion.

zm711 commented 8 months ago

Sorry one last note on profiling then I'm done for now.

If we get rid of the datablock organization within the header parsing we cut the time to 1second (so 2x speed up). So I think the issue with plexon vs something like intan is that intan makes its one memmap and then loads the memmap only really during the get_analogsignal_chunk, whereas plexon does a bunch of data_block organization which requires constant reads from the memmap which vastly slows down the header parsing. (we cut memmap calls from 1second to .51 seconds).

So if we want to enhance the header parsing we should think about limiting the calls to the memmap that is generated. Getting rid of the data_block preparation caused a speed up of 2.67x for the _parse_header() in general (but would break the other code so this would require a careful refactor). @samuelgarcia will have to comment on why we need to do the data_block organization here rather than just find a way to correctly map to the memmap later. But it seems like plexon is super front loading it's organization. I think this was your suggestion with the parse_data_blocks later (@bendichter).

bendichter commented 8 months ago

Thanks for looking into this. Yes, that was my suggestion. Also I would expect the speed up to be way more than 2.67x on a real sized recording file.

zm711 commented 8 months ago

If you are willing to share a file I can do some testing on my end. I was just playing around with the Plexon websites free sample data (which is 10mb). But if you have something bigger I could also profile on that so we could see (it might be worth testing because if this scales I don't know if going from 20 min to 10 min helps enough for making the user wait around).

I just find it super interesting that plexon as is is between 10-100x slower than intan and if I comment out the data_blocks parsing with the small file it is still 5-50x slower than intan. Which makes me think that this is a combo of how we are doing it + some inefficiencies in the .plx format.

bendichter commented 8 months ago

Sounds good. I sent you an invitation to a file on Google Drive via your email.

bendichter commented 8 months ago

I suppose if it is really not feasible to change this on the NEO side, we could create a new type of data interface that delays the initialization of the neo reader and uses custom code to extract metadata. Then all neo readers that have a constructor which takes more time than is acceptable for our use-cases can use this other data interface instead of our current one.

CodyCBakerPhD commented 8 months ago

Wouldn't even need to be a 'new kind' of interface; just override the __init__ behavior specific to the PlexonRecordingInterface to not initialize the extractor until the add_to_nwbfile step, and put the custom metadata parsing in the get_metadata step

Greatest downside is if there's a problem in reading the file data, we usually know that right away from the __init__ step so the user can't normally spend/waste time going further. With this approach now they wouldn't know of issues until the 2nd to last step of the workflow (last being DANDI upload)

bendichter commented 8 months ago

yes, you could do it that way, but I don't like skipping super().__init__

zm711 commented 8 months ago

Sounds good. I sent you an invitation to a file on Google Drive via your email.

I'm downloading it now. I'm doing recordings the rest of the day, but I'll profile it tomorrow and post here. That way you can decide if you want to change on your side and we can decide how updates would scale to large files on the neo side.

CodyCBakerPhD commented 8 months ago

You could super(parent).__init__ to bypass the neo init but not bypass the BaseDataInterface init

bendichter commented 8 months ago

@CodyCBakerPhD yes, that's an option, though I would prefer that the __init__ hierarchy matched the class hierarchy as much as possible, particularly in NeuroConv (as opposed to a lab-specific repo where future-proofing is less of a priority) and particularly in cases that might come up again (in this case for any NeoRawIO that takes a long time to initialize). In any case, let's discuss this off of the neo channel, since it's really a NeuroConv matter.

zm711 commented 8 months ago

So with the file I was testing just so we have baseline and changes. I ran cProfile. And I figured I can just display the raw results. But to summarize first:

total parse_header time with the cProfile overhead 15:39. Of note I did not use tqdm.

47.9% of the time is parse_header function itself. So making the data_blocks basically + metadata
58% of the time is in interacting with the memmaps generated (ie data block stuff)
28.6% of the time is in doing views of various things (both for data_blocks and for metadata)

1    0.000    0.000  939.752  939.752 {built-in method builtins.exec}
        1    0.000    0.000  939.752  939.752 <string>:1(<module>)
        1    0.912    0.912  939.752  939.752 baserawio.py:172(parse_header)
        1  450.398  450.398  938.840  938.840 plexonrawio.py:68(_parse_header)
 97977737  122.140    0.000  268.985    0.000 {method 'view' of 'numpy.ndarray' objects}
195955481  229.603    0.000  251.810    0.000 memmap.py:289(__array_finalize__)
195955696   80.402    0.000  216.492    0.000 memmap.py:334(__getitem__)
 97977688   31.126    0.000   31.126    0.000 _internal.py:525(_view_is_safe)
195955477   11.246    0.000   11.246    0.000 multiarray.py:1393(may_share_memory)
195955489   10.960    0.000   10.960    0.000 {built-in method builtins.hasattr}
 48989122    2.784    0.000    2.784    0.000 {method 'append' of 'list' objects}

And for me doing a dummy removal of organization of the data_blocks (@bendichter's idea to do this later although for my test I was just removing)

We get total time of 6:31.468. (again no tqdm).

_parse_header took 48.6% of the time
view took 29.0% of the time
memmap stuff took 43.1% of the time

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  391.468  391.468 {built-in method builtins.exec}
        1    0.000    0.000  391.468  391.468 <string>:1(<module>)
        1    1.011    1.011  391.468  391.468 baserawio.py:172(parse_header)
        1  190.081  190.081  390.456  390.456 plexonrawio.py:68(_parse_header)
 48988893   58.561    0.000  113.651    0.000 {method 'view' of 'numpy.ndarray' objects}
 97977793   75.073    0.000   85.629    0.000 memmap.py:289(__array_finalize__)
 97978008   37.703    0.000   83.208    0.000 memmap.py:334(__getitem__)
 48988844   14.965    0.000   14.965    0.000 _internal.py:525(_view_is_safe)
 97977789    5.474    0.000    5.474    0.000 multiarray.py:1393(may_share_memory)
 97977801    5.081    0.000    5.081    0.000 {built-in method builtins.hasattr}
 48989133    2.587    0.000    2.587    0.000 {method 'append' of 'list' objects}

Final result of delaying is an overall speedup of 2.4x which seems that improvement would roughly scale. What do you think @bendichter and @CodyCBakerPhD ?