Open bendichter opened 9 months ago
Hi @bendichter. parse_header is more or less : sampling rate + number of channel + t_start + number of sample + metadata.
For some format getting number of sample can take a while because we need to parse with very bad loop the entire file jumping from data block to data block.
We we need this number of sample before the read (get_analog_signal_chunk).
In short what you propose is decomposing the get header in 2 parts : all metadata and then the number of sample. I think this is quite a big internal API change. What could be done is to improve the parse header of some antic readr like plexon one and use numba for jumping data block to data block.
Thanks for the additional context. Let me explain why this is important for us. When we built NeuroConv and NWB GUIDE on top of NEO, we imagined that construction of a NeoRawIO
object would be fast (let's say <10 seconds) for each file, and we built workflows around this assumption. In most cases it is true, but in some cases, Plexon in particular, where this causes a big usability problem. If this were an obscure data format, I would be inclined to just let it be, but Plexon is quite popular and I think this will end up causing a large usability problem for many users.
In the GUIDE, we extract as much metadata as we can from the source files and then have the user fill in the rest themselves. If you want the session start time for a Plexon session, the current workflow is to first initialize the converter, then call .get_metata()
on an interface object, e.g.:
interface = PlexonInterface(file_path="file.plx")
interface.get_metadata()["NWBFile"]["session_start_time"]
PlexonInterface.__init__
internally calls PlexonRawIO.__init__
, so there is currently no way around this long wait, (20 minutes per file, which could mean hours). This causes a big problem for us, as this is in the middle of a workflow when a user is meant to add metadata. I suppose one solution would be to delay the initialization of the neo reader until after initialization of the PlexonInterface
and write our own code for parsing the session start time and any other metadata. But it seems to me that having a PlexonRawIO
constructor that takes 20 minutes might not great for neo in general.
What if in get_analog_signal_chunk
before data_blocks = self._data_blocks[5][chan_id]
you had something like
if self._data_blocks is None:
self._data_blocks = self._parse_data_blocks()
@bendichter,
A couple more points.
is_header_parsed
that can be checked to make sure you don't double parse data. It seems like that is less of a concern for you since you really would prefer metadata vs get sample/data shape to be separate. But this new flag may be helpful as you work (as long as you are >0.13 for neo). numba
into this @samuelgarcia. I'm sure you had some vision in mind, but in general based on my reading numba doesn't help with io questions at all and I'm not sure if (although correct me if I'm wrong) that numba can handle the complex dtypes that so many rawio
s rely on for header parsing. When I was on looking at issue tracker stuff the advice for IO was write a c extension if you really need fast io. These thoughts are a little inchoate, so my apologies, but just trying to brainstorm right now since it would be both an api change as well as a schema change to split up the header parsing. And for my workflows (with neo and with spikeinterface) I'm okay with a long wait right at the beginning, but I totally understand why you wouldn't want to wait around.
I don't know NWBGuide at all, but is there a way that you guys could switch it to async, and then you can have parsing occurring while also collecting metadata? For something like an app/gui framework it would make sense (to me) to have multiple things going on at once rather than trap the user to do one step at a time.
The GUIDE is a graphical wrapper around NeuroConv operations, which has the following multi-step workflow for a RecordingInterface
like Plexon
i) initialize SpikeInterface recording extractor (I believe the contentious thing here is that neo
is parsing header metadata as part of the initialization
ii) (optional) fetch default metadata from the interface. For neo-reliant interfaces, this granted simply retrieves the already parsed metadata
iii) (optional) user modifies or confirms metadata by passing it back to the interface (include file-wide metadata, or cross-object metadata not necessarily just the recording specific stuff). The default just passes back everything that was fetched automatically in (ii).
iv) the user calls run_conversion
, which constructs all the in-memory NWBFile
objects and then writes the file to disk
The GUIDE makes (ii) and (iii) far more intuitive since we can visually display all information with detailed explanations of what various fields mean, which makes it a very interactive process
So for
then you can have parsing occurring while also collecting metadata?
I'm not sure what kind of workflow you're envisioning ; the parsing of a lightweight metadata header, as I understand, is itself a part of the 'collecting metadata' step (the values in the file provide many defaults such as channel names, electrode group structure, gains, even the session start time)
So we can't go from 'select plexon interface' to 'specify source file for plexon' (which then initializes neo object) to 'edit metadata for plexon' asynchronously since the steps are dependent
I'll point out maybe the confusion is due a mix in understand what a parse_header
function should do
As I read it for something like like SpikeGLX, this should just mean 'read the .meta file and form a dictionary out of it'
Or for Intan, 'read the first X number of bytes at the top of the file and form a dictionary out of it'
But my impression for Plexon is that it also scans all the packets of data collecting some summary info and that is the expensive step, right? So it's doing something fundamentally different than other more straightforward header parsing does
I've not delved into the code though so let me know if I'm way off on this
Also: a really cheap way of getting better speed without going to C is to offer a parallelized version of the procedure, if its parallelizable
In the GUIDE, we extract as much metadata as we can from the source files and then have the user fill in the rest themselves
In Ben's message I thought he said you were collecting some metadata from the user. So you can have the 20min long parsing of the file based metadata + the rest of neo's parsing (number of samples memmap etc) + also collect the user based metadata so that them inputting various info is occupying the same time rather than waiting the full 20 minutes to finish before being able to work on (the rest of) metadata. But based on your (iii) maybe it is actually that you allow them to modify the metadata rather than just collect metadata not saved in the file?
(NB I'm referring to file-based metadata which Neo is grabbing and user-based which would be like animal genotype, name of experiment which only the user could provide as two separate things).
As I read it for something like like SpikeGLX, this should just mean 'read the .meta file and form a dictionary out of it'
Or for Intan, 'read the first X number of bytes at the top of the file and form a dictionary out of it'
But my impression for Plexon is that it also scans all the packets of data collecting some summary info and that is the expensive step, right? So it's doing something fundamentally different than other more straightforward header parsing does
The parse_header
does two things. It creates a dictionary of the metadata
for example for SpikeGLX all metadata is actually stored in the signals_info_list
within the reader. But it also uses parts of that metadata to generate the memory map of the file. This involves getting the total number of samples and fitting those samples to the data structure (which can be super slow for an inefficient format like Plexon). When I use Intan my parse_header
time often takes 5-10 minutes (although I often do it over the network which slows things down). Multiple rawio._parse_header()
functions do the full parsing as I just described, but for some formats this process is fast and for some slow. spikeglx just as the metafile which is easy to generate the dictionary from and then binary files which can be pretty rapidly mapped, other files are not as easy to do.
Again @samuelgarcia is the authority on the rawio
flow and goals so I would defer to him if he wants to correct anything I've just said. Conceptually I understand what you want, but I think the should in this case was that parse_header did all the long hard work at the beginning so the user would never have to do that again. So based on the original schema Plexon is doing what it is suppose to it is just an extreme example of how slow that strategy can be.
Also: a really cheap way of getting better speed without going to C is to offer a parallelized version of the procedure, if its parallelizable
This is beyond me so if Sam knew how to do this it might be possible, but I would have to read more to feel like I could safely do this. Currently the idea is you read a block of the file using numpy functions serially, but with parallel I don't know how I could make sure we are safely going through the data blocks.
Hope you're well @CodyCBakerPhD :)
also collect the user based metadata so that them inputting various info is occupying the same time rather than waiting the full 20 minutes
It doesn't take 20 minutes to type in the info for those forms; and unless neo somehow provides to us what fields are in the files prior to parsing, we can't let the user know 'this field will be inferred from your Plexon files'
The long expensive part of the conversion process is after you hit the 'run conversion' button, which is why we're pushing to have anything that needs to iterate over data packets occur then.
The parse_header does two things.
Thanks for the explanation. Shocking to hear that you don't have sufficient information to form a memory map from a very small amount of top-level text values (all you ought to need is dtype, column order, and shape?)
Could you get the shape from a single packet, then use the packet size as a heuristic to multiply by the number of packets?
Otherwise, what about a pre-step that compiles and caches a neo-required fields from an inefficient format, such that the primary workflow can then always expect the information in either a lightweight text file like SpikeGLX, or the precomputed lightweight file for others
It doesn't take 20 minutes to type in the info for those forms
I know. Just trying to occupy some of the time so it doesn't feel as long. But not knowing what to fill in until after parsing would make this unhelpful.
Thanks for the explanation. Shocking to hear that you don't have sufficient information to form a memory map from a very small amount of top-level text values (all you ought to need is dtype, column order, and shape?)
For simple file formats that is all you need. (Honestly we always seem to have 'c' column order. I don't know why, but makes sense that most files would just use 'c' order). + offset (to account for the header). So dtype, column order, shape, and offset. But for complicated formats you need to traverse the file to get this info. Heberto was telling me that Sam told him (so lots of heresy) that one of the files that Neo supports has its header in the middle of the file, which means the file needs to be searched to find the header which will take time as well (people should really follow the convention that the header is at the head of the file). So a single "packet" is not always equivalent to all packets.
This being said I ran a couple tests just now see below:
I did just run a test and a 100mb Intan file was parsing in 100ms vs a 10mb Plexon file was taking 1.56 seconds. So we are looking at an 100x slower if the files were the same size (assuming they scale). I think the issue is that some file formats have nice packets and some don't. For Plexon the different sets of data are set-up like this (this is just to get the dtype information for files in the different parts of the headers: https://github.com/NeuralEnsemble/python-neo/blob/0c175b4ece81cc588971a02069af5d87e91f9cb6/neo/rawio/plexonrawio.py#L511C1-L613C14.
Again just comparing intan and plexon, intan makes one memory map with only one global offset for the header. Plexon makes 5 memory maps where we need to adjust the offset not based on just the global header, but the sub-headers. Then we do the final data cleanup which I assume is part of the slow part since it is a nested for loop.
I think one variable between RawIO
s that may also effect this is that some RawIOs try to pull out events, spikes, and waveforms at the same time. This often requires traversing the data. Currently intan only pulls the signal and doesn't spend time traversing for things like spike waveforms which plexon does. If people want to use Neo as a standalone it makes sense to pull out all that info. But for something like spikeinterface or neuroconv it is probably not useful to preload these when you want the raw data and you can find spikes yourself later. So maybe that is contributing. --don't want to delete this but I just tested and I went from an average of 1.56 seconds to 1.52 so this idea is not contributing. Shutting off tqdm
led to a 1.16 x speed up, though.
Some other quick facts, parsing the intan header (for metadata) took ~18ms, making the memap took 200 microseconds, so the other 80ms in this case is coming from rawio overhead. It might be a decent idea to run a more sophisticated profile for your Plexon example data so we can better figure out where things are getting hung up.
Just ran cProfile and got that for 2s (with the cprofile overhead) that the breakdown for Plexon is:
_parse_header by itself it contributing 0.824 seconds
numpy.view is 0.619 seconds (including inner function calls)
numpy memmap is 1.0 seconds (including inner function calls, but is split .5 for __array_finalize__
and .5 __getitem__
)
a few more numpy methods 0.2 seconds then everything is basically 1 ms or less.
Otherwise, what about a pre-step that compiles and caches a neo-required fields from an inefficient format, such that the primary workflow can then always expect the information in either a lightweight text file like SpikeGLX, or the precomputed lightweight file for others
Forgot to respond to this part. That could work, but again that's a bit more of an api level change. Could be nice though. We would really need to profile plexon though.
@zm711
Just trying to occupy some of the time so it doesn't feel as long
I get the idea, but also note we are talking 20 minutes per session, and there could easily be 20 sessions in a conversion.
Sorry one last note on profiling then I'm done for now.
If we get rid of the datablock organization within the header parsing we cut the time to 1second (so 2x speed up). So I think the issue with plexon vs something like intan is that intan makes its one memmap and then loads the memmap only really during the get_analogsignal_chunk
, whereas plexon does a bunch of data_block organization which requires constant reads from the memmap which vastly slows down the header parsing. (we cut memmap calls from 1second to .51 seconds).
So if we want to enhance the header parsing we should think about limiting the calls to the memmap that is generated. Getting rid of the data_block preparation caused a speed up of 2.67x for the _parse_header()
in general (but would break the other code so this would require a careful refactor). @samuelgarcia will have to comment on why we need to do the data_block organization here rather than just find a way to correctly map to the memmap later. But it seems like plexon is super front loading it's organization. I think this was your suggestion with the parse_data_blocks
later (@bendichter).
Thanks for looking into this. Yes, that was my suggestion. Also I would expect the speed up to be way more than 2.67x on a real sized recording file.
If you are willing to share a file I can do some testing on my end. I was just playing around with the Plexon websites free sample data (which is 10mb). But if you have something bigger I could also profile on that so we could see (it might be worth testing because if this scales I don't know if going from 20 min to 10 min helps enough for making the user wait around).
I just find it super interesting that plexon as is is between 10-100x slower than intan and if I comment out the data_blocks parsing with the small file it is still 5-50x slower than intan. Which makes me think that this is a combo of how we are doing it + some inefficiencies in the .plx format.
Sounds good. I sent you an invitation to a file on Google Drive via your email.
I suppose if it is really not feasible to change this on the NEO side, we could create a new type of data interface that delays the initialization of the neo reader and uses custom code to extract metadata. Then all neo readers that have a constructor which takes more time than is acceptable for our use-cases can use this other data interface instead of our current one.
Wouldn't even need to be a 'new kind' of interface; just override the __init__
behavior specific to the PlexonRecordingInterface
to not initialize the extractor until the add_to_nwbfile
step, and put the custom metadata parsing in the get_metadata
step
Greatest downside is if there's a problem in reading the file data, we usually know that right away from the __init__
step so the user can't normally spend/waste time going further. With this approach now they wouldn't know of issues until the 2nd to last step of the workflow (last being DANDI upload)
yes, you could do it that way, but I don't like skipping super().__init__
Sounds good. I sent you an invitation to a file on Google Drive via your email.
I'm downloading it now. I'm doing recordings the rest of the day, but I'll profile it tomorrow and post here. That way you can decide if you want to change on your side and we can decide how updates would scale to large files on the neo side.
You could super(parent).__init__
to bypass the neo
init but not bypass the BaseDataInterface
init
@CodyCBakerPhD yes, that's an option, though I would prefer that the __init__
hierarchy matched the class hierarchy as much as possible, particularly in NeuroConv (as opposed to a lab-specific repo where future-proofing is less of a priority) and particularly in cases that might come up again (in this case for any NeoRawIO that takes a long time to initialize). In any case, let's discuss this off of the neo channel, since it's really a NeuroConv matter.
So with the file I was testing just so we have baseline and changes. I ran cProfile. And I figured I can just display the raw results. But to summarize first:
total parse_header time with the cProfile overhead 15:39. Of note I did not use tqdm.
1 0.000 0.000 939.752 939.752 {built-in method builtins.exec}
1 0.000 0.000 939.752 939.752 <string>:1(<module>)
1 0.912 0.912 939.752 939.752 baserawio.py:172(parse_header)
1 450.398 450.398 938.840 938.840 plexonrawio.py:68(_parse_header)
97977737 122.140 0.000 268.985 0.000 {method 'view' of 'numpy.ndarray' objects}
195955481 229.603 0.000 251.810 0.000 memmap.py:289(__array_finalize__)
195955696 80.402 0.000 216.492 0.000 memmap.py:334(__getitem__)
97977688 31.126 0.000 31.126 0.000 _internal.py:525(_view_is_safe)
195955477 11.246 0.000 11.246 0.000 multiarray.py:1393(may_share_memory)
195955489 10.960 0.000 10.960 0.000 {built-in method builtins.hasattr}
48989122 2.784 0.000 2.784 0.000 {method 'append' of 'list' objects}
And for me doing a dummy removal of organization of the data_blocks (@bendichter's idea to do this later although for my test I was just removing)
We get total time of 6:31.468. (again no tqdm).
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 391.468 391.468 {built-in method builtins.exec}
1 0.000 0.000 391.468 391.468 <string>:1(<module>)
1 1.011 1.011 391.468 391.468 baserawio.py:172(parse_header)
1 190.081 190.081 390.456 390.456 plexonrawio.py:68(_parse_header)
48988893 58.561 0.000 113.651 0.000 {method 'view' of 'numpy.ndarray' objects}
97977793 75.073 0.000 85.629 0.000 memmap.py:289(__array_finalize__)
97978008 37.703 0.000 83.208 0.000 memmap.py:334(__getitem__)
48988844 14.965 0.000 14.965 0.000 _internal.py:525(_view_is_safe)
97977789 5.474 0.000 5.474 0.000 multiarray.py:1393(may_share_memory)
97977801 5.081 0.000 5.081 0.000 {built-in method builtins.hasattr}
48989133 2.587 0.000 2.587 0.000 {method 'append' of 'list' objects}
Final result of delaying is an overall speedup of 2.4x which seems that improvement would roughly scale. What do you think @bendichter and @CodyCBakerPhD ?
Is your feature request related to a problem? Please describe. Initializing
PlexonRawIO
can take 20 minutes to parse the headers for a real file. This is a problem e.g. in NWB GUIDE, where we need to get metadata. This requires initialization of every session, which would amount to making the user wait for hours in the middle of building the conversion just to fetch the metadata from each session.Describe the solution you'd like I would like to keep the metadata parsing in the constructor and move the expensive heading to the read command. This would use caching to ensure it is only executed once, even when the read command is used multiple times. This would not improve the speed, but would delay the long wait until a time that is much better for usability.
Describe alternatives you've considered It would be great to also improve the efficiency of the header parsing code if possible.