v0.8.0 performance degradation

dizcza commented 5 years ago

Long story short, some analysis gets slower by x14 times (from 90 sec to 1279 sec) by upgrading neo from 0.7.2 to 0.8.0.

The bottleneck is in utility function cut_segment_by_epoch, that till v0.8.0 has been carried on from repo to repo and now, since it's in neo release, I decided to switch it to your utils.py implementation. Previously, we used

def seg_time_slice(seg, t_start=None, t_stop=None, reset_time=False, **kwargs)

that is now replaced by segment.time_slice() (I can share the script if needed). Here is its timing profiler bottleneck, made with https://github.com/rkern/line_profiler:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================

<0.7.2 our utils.py, seg_time_slice()  func>

  1485    165891     132936.0      0.8      0.2      for st_id in range(len(seg.spiketrains)):
  1486    165165     210739.0      1.3      0.3          if isinstance(seg.spiketrains[st_id], neo.SpikeTrain):
  1487    165165   36760743.0    222.6     56.6              st_time_slice = seg.spiketrains[st_id].time_slice(t_start, t_stop)
  1488                                                   elif (hasattr(neo.io, 'proxyobjects') and
  1489                                                         isinstance(seg.spiketrains[st_id], neo.io.proxyobjects.SpikeTrainProxy)):
  1490                                                       st_time_slice = seg.spiketrains[st_id].load(time_slice=(t_start, t_stop))

<0.8.0 neo/utils.py>

   317    165891     132030.0      0.8      0.0          for st_id in range(len(self.spiketrains)):
   318    165165     273001.0      1.7      0.0              if hasattr(self.spiketrains[st_id], '_rawio'):
   319     55055  928402216.0  16863.2     90.9                  st_time_slice = self.spiketrains[st_id].load(time_slice=(t_start, t_stop))
   320                                                       else:
   321    110110   83800203.0    761.1      8.2                  st_time_slice = self.spiketrains[st_id].time_slice(t_start, t_stop)
   322    165165     277507.0      1.7      0.0              if reset_time:
   323                                                           st_time_slice = st_time_slice.time_shift(t_shift)
   324    165165     327103.0      2.0      0.0              subseg.spiketrains.append(st_time_slice)

Loading data now takes 47 sec (was 25 sec in 0.7.2) just because now the if condition is triggered while previously it was rejected and branched to else statement.

            if hasattr(neo.io, 'proxyobjects') and isinstance(st,
                                                              neo.io.proxyobjects.SpikeTrainProxy):
                st_for_calc = st.load(load_waveforms=True)
            else:
                st_for_calc = st

samuelgarcia commented 5 years ago

Hi. thanks you for catching this.

To be sure: is the slow down is when lazy=True or lazy=False (reader.read_clock(lazy=False) )

dizcza commented 5 years ago

We pass lazy=True but then

    if lazy and not hasattr(neo.io, 'proxyobjects'):
        warnings.warn('Your version of neo does not support proxyobjects. Setting lazy=False.')
        lazy = False

and so it becomes False with neo v0.7.2.

samuelgarcia commented 5 years ago

OK. Ithink ther performence diefference could simply be explain by the fact that when lazy=False then the whole dataset is load and then sliced dirrectly on memory. This process can be faster in some case if the datset fit the memory for instance.

Slicing with with lazy loading is more elegant, use less memory but do not ensure fatser computation. Slincing on disk depending the IO can have a cost. If the IO have a memmap then it should be fst but in some case slicing on disk a tricky block access and concatenating that can make the limit the speed.

Could you try the slicing with lazy=False in 0.8.0 and tell me the performences ?

dizcza commented 5 years ago

With lazy=False it takes 36 sec.

samuelgarcia commented 5 years ago

So the matter is not is not 0.7.2 vs 0.8.0 but lazy=False vs lazy=True. No ?

My guess is that the slicing on disk can be more performent when slices summed duration over total time duration of the segment is small. In short slicing with lazy=True is usefull for small slices with very long recordings.

I don't knwon if I am clear enought.

Why IO ?

dizcza commented 5 years ago

Why IO ?

What?

You can say that lazy flag is the culprit as long as you agree that 36 sec ~ 25 sec. If you think it's fine, then it's fine. Maybe it makes sense though to add some benchmarks to your tests so that you can see the speedups and slowdowns immediately.

dizcza commented 5 years ago

Please update https://neo.readthedocs.io/en/latest/releases/0.8.0.html#lazy-loading accordingly (and here https://neo.readthedocs.io/en/latest/io.html#lazy-option-and-proxy-objects is not clear about time-memory tradeoff). Now it states

This can lead to major savings in time and memory consumption

At least, make users aware that the effect of lazy=True might speed up (if you ever encounter this) as well as slow down.

samuelgarcia commented 5 years ago

I meant with wich IO ? Sorry. I should read what I am writting.

My guess is that perf with lazy=True/false are IO dependdent.

I don't anderstand all your time perf. To clarrify, is it ?:

0.7.2 and lazy=False : 90s or 25s
0.8.0 and lazy=False : 36s
0.8.0 and lazy=True : 1279s

What is the total enderlying segment durations ? On how slices and durations are you looping over ?

dizcza commented 5 years ago

lazy flag relates to my second benchmark only

0.7.2: 25 sec, lazy=False
0.8.0: 36 sec for lazy=False and 47 sec for lazy=True

which IO

We use

block = neo.io.BlackrockIO(filename=v_data_full_path, nsx_to_load=nsx_number,
                                     nev_override=v_sorted_full_path)

that filled up with .hdf5 files for LFPs later on (and a bunch of .nev, .ns2 files)

samuelgarcia commented 5 years ago

OK. Thank you. Internanly for BlacrockIO it is a pure memmap for signals so it should be super fast. And we should have a speedup.

For spiketrain, in blacrockIO the lazy load is not efficient because all spikes are in the same file and for each slice we need to run throught the file to keep spike that belong to the wanted slice.

Maybe it could be this.

I don't have more idea.

Maybe it could be waveforms for spiketrain that were not load before an dthat are load now.

Does the hdf5 file is bigger now ?

Could you try to make the same benchmark for blacrock but only with analogsignals ? By copying only ns6 to another directory.

jpgill86 commented 5 years ago

I'm just curious, is the first performance degradation you described (90 sec to 1279 sec for segment slicing) still bad in 0.8.0 with lazy=False?

I wonder if the huge slowdown is the cumulative overhead of calling BlackrockIO's inefficient lazy loading of spiketrains 100k+ times. I'm assuming that with lazy=False the file is read just once and slicing is performed on the block in memory, which should be fast, I think.

dizcza commented 5 years ago

With the growing comments, I decided to switch discussing and fixing the issue in a session with @JuliaSprenger that will take place in two days, on Wed.

Kleinjohann commented 5 years ago

I wonder if the huge slowdown is the cumulative overhead of calling BlackrockIO's inefficient lazy loading of spiketrains 100k+ times. I'm assuming that with lazy=False the file is read just once and slicing is performed on the block in memory, which should be fast, I think.

Exactly, that is the main issue. The most general way of tackling this that I could come up with would be the possibility to load multiple time slices of spiketrains at once. This would have to be done on the RawIO level and would result in only one load from hard disk per spiketrain for all epoch times at once in this use case. All of the time slicing could then be performed at once on the same temporarily loaded array. This could be achieved by e.g. allowing t_start and t_stop to be 1D arrays in all the functions required for this operation. Not sure if this would overcomplicate the loading and/or cause other issues elsewhere, though. Since the spiketrain loading is IO specific this would have to be changed in every RawIO individually.

samuelgarcia commented 5 years ago

In short. We should change the documentation soon because the lazy load have been written with signal in mind. And I think this is efficient for most IOs.

For spiketrain and events of course having a slice the process is not efficient at all for most IOs.

So for spikes I would not use lazy load at all. Or load evrything immediatly do something like:


reader = MyIo()
seg = reader.read_segment(lazy=True)
# load all spiketrian in mem
spiketrains =  [st.load() for st in seg.spiketrains]
# and keep the lazy load for Analosgisgnal...

Maybe for a next version we could have a selective lazy option

seg = reader.read_segment(lazy = dict('AnalogSignal'=True, 'SpikeTrain'=False))

dizcza commented 4 years ago

I'm not sure whether we should complicate 'laziness' logic by introducing dict('AnalogSignal'=True, 'SpikeTrain'=False). In the future, one might want to extend it even more, leading to an overcomplicated structure. I'd suggest stick to either False or True and put the example you showed in the docs. Or even better - always ignore laziness when loading spiketrain (if it's possible).
There is still a small slowdown in loading the block with BlackRockIO:

v0.7.2

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================

   146         1          2.0      2.0      0.0      motor_file = neo.io.BlackrockIO(filename=m_data_full_path, nsx_to_load=nsx_number,
   147         1    3333432.0 3333432.0     15.3                                      nev_override=m_sorted_full_path)
   148         1          6.0      6.0      0.0      visual_file = neo.io.BlackrockIO(filename=v_data_full_path, nsx_to_load=nsx_number,
   149         1    3815768.0 3815768.0     17.6                                       nev_override=v_sorted_full_path)
   150                                           
   151                                               # read both blocks
   152         1    6556341.0 6556341.0     30.2      motor_block = motor_file.read_block(load_waveforms=load_waveforms, lazy=lazy)
   153         1    8030084.0 8030084.0     36.9      visual_block = visual_file.read_block(load_waveforms=load_waveforms, lazy=lazy)

v0.8.0

   146         1          2.0      2.0      0.0      motor_file = neo.io.BlackrockIO(filename=m_data_full_path, nsx_to_load=nsx_number,
   147         1    1768212.0 1768212.0      5.4                                      nev_override=m_sorted_full_path)
   148         1          4.0      4.0      0.0      visual_file = neo.io.BlackrockIO(filename=v_data_full_path, nsx_to_load=nsx_number,
   149         1    1925418.0 1925418.0      5.8                                       nev_override=v_sorted_full_path)
   150                                           
   151                                               # read both blocks
   152         1   12967438.0 12967438.0     39.3      motor_block = motor_file.read_block(load_waveforms=load_waveforms, lazy=lazy)
   153         1   16324623.0 16324623.0     49.5      visual_block = visual_file.read_block(load_waveforms=load_waveforms, lazy=lazy)

Both with lazy=False.

New neo version creates BlackRockIO faster but loads a block slower, leading to the summed benchmark

v0.7.2: 9.9 sec v0.8.0: 14.7 sec

This is not a big deal for a single slowdown. But if the implementation increases the loading time by a factor of 1.5 from version to version, eventually, it'll lead to exponential growth of the runtime. But I agree, this scenario is hardly possible.

This is all for these two issues. You can close the issue if you think this is not a problem.

samuelgarcia commented 4 years ago

Thank for benchmark report. This slow down is really strange with lazy=False. I need to think about it. Keep this open. It is important.

zm711 commented 4 months ago

Is there any interest in repeating this now that we are on 13.1. If no response within a week I will close this since we haven't explored this for nearly 4 years.

NeuralEnsemble / python-neo

v0.8.0 performance degradation #746