Time expensive conversion step

samuelbray32 commented 8 months ago

The largest time contributor left in the conversion is: objectmapper.py:183(convert_dtype) from the hdmf package
When the write step of conversion begins, there is a long lag period (especially for larges files) where the new nwb file size remains constant at ~96Kb
Many calls of that convert_dtype function above happen during this stall

I Just wanted to check if anyone knew of a redundant conversion that might be happening here we could avoid

edeno commented 8 months ago

@rly any thoughts?

rly commented 8 months ago

The NWB ElectricalSeries, SpatialSeries, and TimeSeries types allow any numeric dtype for data, so no conversion should be done. The timestamps field must be float64 but as far as I can tell, they are already float64 for the ephys, analog, DIO, and sample count data. Are the position timestamps in float64 before conversion? It's hard for me to trace the code.

Also, are they all numpy arrays? If the data/timestamps is a list or tuple, then HDMF calls convert_dtype on each element which will be slow... It looks like the data/timestamps for ephys, analog, DIO, sample count, and position are all numpy arrays. It's possible the slowdown could be from the electrode table data, but those are relatively small and should not result in a slowdown that scales with data size.

One thing you could do to figure out where the bottleneck is to do the conversion for just the analog data, just the ephys data, just the position data, etc. and see which one calls convert_dtype too many times.

samuelbray32 commented 8 months ago

Thanks for the insight, I'll work on tracing that this afternoon

samuelbray32 commented 8 months ago

Think I found the source. When combining multiple rec files, the RecFileDataChunkIterator was extending timestamps as a list rather than concatenating into an array. Applying https://github.com/LorenFrankLab/spikegadgets_to_nwb/commit/92ae309d46c58fa9daeaaa103255fa53f4d03ac7 cuts conversion times in ~half

edeno commented 8 months ago

Nice!

rly commented 8 months ago

Great! Could you also update this line https://github.com/LorenFrankLab/spikegadgets_to_nwb/blob/main/src/spikegadgets_to_nwb/convert_ephys.py#L149 where multiple chunks are extended into a list and then converted into a numpy array? In my toy example tests, the list approach is slower, but how much depends on the chunk sizes.

edeno commented 8 months ago

Fixed

LorenFrankLab / trodes_to_nwb

Time expensive conversion step #46