WIP: Speed up a few slowdowns when handling large datasets

This is a summary thread for a few slowdowns that I noticed when handling large-ish datasets (e.g., 15000 templates x 500 channels x 1800 samples). I'm not calling them "bottlenecks" because it's rather the sum of things together that take extra time, none of these slowdowns changes EQcorrscan fundamentally.

I will create some PRs for each point that I have a suggested solution so that we can systematically merge, improve, or reject the suggested solutions. Will add the links to the PRs here, but I'll need to organize a bit for that..

Here are some slowdowns (tests with python 3.11, all in serial code):

1. tribe._group_templates: 50x speed up
  - problem O^2 double loop takes ~20 s for some thousand templates. Sped up single loop to <0.5 s.
  - PR: https://github.com/eqcorrscan/EQcorrscan/pull/524
1. preprocessing._prep_data_for_correlation: 3x speed up for function:
  - Problem: with heterogeneous templates (i.e., many templates with different station setups), filling the templates with NaN-channels takes a long time (many copy-operations in serial). (ca. 150 s -> 50 s for 1500 templates with up to 500 channels).
  - PR: https://github.com/eqcorrscan/EQcorrscan/pull/525
1. matched_filter.match_filter with copy_data=True: 3x speedup for copy
  - copying of streams (templates and continuous data) takes XYZ s because deepcopy of streams/traces is slow. Custom copy functions can speed it up by a factor of ~3x. Circa 18 s previously for 300 24h- traces, now ~6 s.
  - PR (same as above): https://github.com/eqcorrscan/EQcorrscan/pull/525
1. detection, lag_calc, pre_processing: 4x / 100x speedup for trace selection
  - trace selection by trace id from a stream is still a slowdown (even after https://github.com/obspy/obspy/pull/2886). Can be speed up by x4 with simplified function and ~x100 with dict-lookup for streams with fixed order of traces.
  - PR: https://github.com/eqcorrscan/EQcorrscan/pull/526
1. core.match_filter.family._uniq: 1.9x speedup
  - retrieving the unique list of detections is quicker for many detections with list(set) (1.9x speedup for 43000 detections, fastest: 3.1 s), but 1.2x slower for small sets (e.g., 430 detections; 50 ms --> 27 ms).
  - PR: https://github.com/eqcorrscan/EQcorrscan/pull/527
1. core.match_filter.detect - 1000x speed up for many calls to family._uniq
  - using family._uniq in a loop over all families is still rather slow with _uniq. Checking tuples of (detection.id, detection.detect_time, detection.detect_val) with numpy.unique and avoiding a loop is 1000x faster. From 752 s to <1 s for 82000 detections.
  - PR (same as above): https://github.com/eqcorrscan/EQcorrscan/pull/527
1. matched_filter._group_detect: 30x speedup in handling detections
  - selecting detections by template name in big list can be slow via loop. Dict-lookup offers ~50x speedup
  - adding prepick to many picks (e.g., 400k) is somewhat slow because of UTCDateTime overhead. ~4x speedup with adding to pick.time.ns directly.
  - PR: https://github.com/eqcorrscan/EQcorrscan/pull/528

Is your feature request related to a problem? Please describe. All the points in the upper list occur in parts of the code where parallelization cannot help speed up execution . When running EQcorrscan on a big cluster, it's wasteful to spend as much time reorganizing data in serial as it takes to run the well parallelized template matching correlations etc.

Three more slowdowns where parallelization can help:

1. core.match_filter: 2.5x speedup for MAD threshold calc
  - np.median(np.abs(cccsum)) for each cccsum takes a lot of time when there are many cccsum in cccsums. Only quicker solution I found was to parallelize the operation, which surprisingly could speed up problems bigger than ~15 cccsum already. The speedup is only ~2.5x, so even though that matters a lot for many cccsum (e.g., 2000: 20 s vs 50 s), it feels like this has more potential for even more speedup.
  - PR: https://github.com/eqcorrscan/EQcorrscan/pull/531
1. detection._calculate_event: 35% speedup in parallel
  - calling this for many detections is slow when a lot of events need to be created. Parallelization can help to speed this up a bit (35 % for 460 detections in test case).
1. utils.catalog_to_dd.write_correlations: 20 % speedup using some shared memory
  - Starting each worker for one reference event is slow because the event and stream for all neighbors of the reference event need to be pickled to the worker. Using some shared memory should be able to help here (PR: 20 % speedup by moving trace.data numpy-arrays into shared memory)
  - PR: https://github.com/eqcorrscan/EQcorrscan/pull/529

I have now added 5 PRs for the upper list that you can have a look at - the links to the PRs are added to the list.

I'll prepare the PRs for the lower list (points 8-10.) in a bit; but these could maybe benefit from some extra ideas.

Awesome @flixha - thanks for this! Sorry for no response - hopefully you received at least one out of office email from me - I have been in the field for about the last month and it is now our shutdown period so I likely won't get to this until January. Sorry again, and thanks for all of this - this looks like it is going to be very valuable!

Hey @flixha - Happy new year! I'm not back at work yet, but I wanted to get the CI tests running again to try and help you out this those PRs - it looks like a few PRs are either failing or need tests. Happy to have more of a look in the next week or so.

Happy New Year @calum-chamberlain ! Yes I did see that you were in the field, and since I posted the PRs just before the holiday break I certainly did not expect a very quick reply - was just a good way for me to finally cut these changes into compact PRs. Thanks a lot for already fixing the tests and going through all the PRs!

I'm looking into two more points right now which can slow down things for larger datasets:

matched_filter loops over all templates to make families with the related detections. When adding the family to the party (see this line), there is another loop in which the template for the family being added is compared to each template that already forms a family in the party. That comparison comprises a check of all properties including the trace data and event properties, hence is not super cheap. And it's scaling with O^2 related to the number of templates/families, so in an example with 15500 templates there were 544 561 637 calls to template.__eq__, costing in total 250 seconds).
- I think we don't really need the full check for existing families in matched_filter as long as the tribe doesn't contain duplicated templates. But for that we have other checks already, so here a simple concatenation without the checks should suffice?
- i.e., party.families.append(family) instead of party += family
For the full dataset, utils.correlate._get_array_dicts takes 400 seconds to prepare the data, which is in part due to handling a lot of UTCDateTime, sorting the traces in each template, and other list / numpy operations. I'm looking into how to speed that up... (PR here: https://github.com/eqcorrscan/EQcorrscan/pull/536)

@flixha - do you make use of the group_size argument to split your large template sets into groups for matched-filtering? I am starting to work on changing how the different steps of the process are executed to try and make better use of cpu time.

As you well know, a lot of the operations before getting to the correlation stage are done in serial, and a lot of them do not make sense to port to parallel processing. What I am thinking about is implementing something similar to how seisbench uses either asynio or multiprocessing to process the next chunk of data while other operations are taking place. See their baseclass for an example of how they implement this. I don't think using asyncio would make sense for EQcorrscan because most of the slow points are not io bound, so using multiprocessing makes more sense. By doing this we should spend less time in purely serial code, and should see speed-ups (I think). In part this is motivated by me getting told off by HPC maintainers for not hammering their machines enough!

I'm keen to do this and there are two options I see:

Implementing at a group level, such that the next group's steps are processed in advance. This would work on the same chunk of data, so you would only see speed-ups for large template sets.
Implementing at a higher level across chunks of data, possibly in Tribe.client_detect such that the next chunk of data is downloaded and processed in advance. I'm not sure that these two options are mutually exclusive, but it would take some thinking to make sure that I'm not trying to create child processes from child processes.

Keen to hear your thoughts. I think this would represent a fairly significant refactor of the code in Tribe and matched_filter so I was hoping to work on this once we have merged all of your speed-ups to avoid conflicts.

Hi @calum-chamberlain , yes I do use the group_size in more or less all the bigger problem sets as I'd otherwise run out of memory. For a set of 15k templates I was using group_size=1570 (i.e., 10 groups) with fast_matched_filter on 4x A100 (4x16 GB GPU memory, 384 GB system memory), while with fmf2 on 4x AMD MI-250X (4x128 GB GPU memory, 480 GB system memory) I was using group_size=5500 (3 groups; caveat: fmf and fmf2 may duplicate some memory unnecessarily across multiple GPUs, but that's a minor thing) - just for some examples (don't remember the biggest fftw examples right now). So of course the preparation of the data before starting the correlations for a group does take quite a bit of time, so good to hear you're thinking about this.

Just some quick thoughts (may add more tomorrow when I'm fully awake ;-)):

For some of my problem sets where the channel selection varies quite a bit between templates, it would save time and memory to "smartly" group the templates in those that share as many channels as possible - means less NaN-channels have to be created / copied which a good chunk of time.
The bigger the group, the more NaN-channels per template may have to be created in memory, with the template with most traces contributing considerably
I have thought that asynchronous preparation of the data while correlations are running - sounds like a good idea but there's some caveats:
- On Slurm clusters, I run array jobs to chunk the problem at the highest level (across time periods), so that I distribute the serial parts of the code across as many array tasks as possible. Usually I can fit at most 3 array tasks into one node, and mostly just one, due to the memory requirements.
- So at some point working at a higher level in python rather than e.g. Slurm would require looking into parallelization across nodes / distributed memory
- So I'm a bit afraid that preparing an extra group while the current one is running may not pay off so well. It sounds tempting especially when running correlations on GPUs to make better use of the CPUs in the meantime, but the biggest "spike" in memory consumption for GPUs may be when the full correlation matrix is returned from the GPU memory to the main memory, so I'd rather not give away any at that point.
- For my problems, preparing the arrays for correlation took probably about 2x - 8x of local reading and preprocessing
- For multiple groups we spend a lot of time preprocessing traces, creating traces / arrays in memory that we then throw away again - there could be some gain in keeping preprocessed streams and arrays between groups if the processing parameters stay the same

  * On Slurm clusters, I run array jobs to chunk the problem at the highest level (across time periods), so that I distribute the serial parts of the code across as many array tasks as possible. Usually I can fit at most 3 array tasks into one node, and mostly just one, due to the memory requirements.
To specify: for the 15k templates I could only fit one array task per node; while for picking with the same tribe I could fit 2 or 3, depending on the node configuration.

Thanks for those points.

I thought we were grouping somewhat sensibly, but it might just be grouping based on processing rather than the most consistent channel sets - this should definitely be done.
I do similar with splitting chunks of time across nodes. However I have been chunking my datasets into ndays where ndays is typically total days / nodes rather than splitting day by day. In that regard, eagerly processing the next timestep of data would be fine because it would stay on the same node.
I have been wondering for a while about using dask's schedulers for working across clusters. At some point I would like to explore this more, but I think it might be worth making a new project that uses EQcorrscan. I haven't managed to get the schedulers to work how I expect them to work either.
Agree on duplicated work. When EQcorrscan started I did not anticipate either the scales of template sets that we would end up using it for (I thought a couple of hundred templates was big back in the day!), nor that we might use different parameters within a Tribe - the current wasteful logic is a result of that. I would really like to redesign EQcorrscan (less duplication, more time directly working on numpy arrays rather than obspy traces, making use of multithreading on GIL-releasing funcs, ..., the list goes on) but I don't know if I will ever have time.

* I thought we were grouping somewhat sensibly, but it might just be grouping based on processing rather than the most consistent channel sets - this should definitely be done.
I think it's only processing parameters right now, so indeed it should be worth changing that
* I do similar with splitting chunks of time across nodes. However I have been chunking my datasets into `ndays` where `ndays` is typically `total days / nodes` rather than splitting day by day. In that regard, eagerly processing the next timestep of data would be fine because it would stay on the same node.
If I understand correctly, each node gets as a task e.g. "work on these 30 days", is that correct? Indeed that's how I do it; restarting the process for each day would be too costly for reading the tribe and some other metadata (but maybe I misunderstood)
* I have been wondering for a while about using [dask's](https://docs.dask.org/en/stable/) schedulers for working across clusters. At some point I would like to explore this more, but I think it might be worth making a new project that uses EQcorrscan. I haven't managed to get the schedulers to work how I expect them to work either.
That's probably the more sensible way to go for now. As an additional point to this; right now it's easy for the user to their own preparations on all the streams according to their processing before calling EQcorrscan's main detection functions; with dask I imagine the order of things and how to do them would need some thought (set up / distribute multi-node jobs; let user do some early processing; start preprocessing / detection functions)
* Agree on duplicated work. When EQcorrscan started I did not anticipate either the scales of template sets that we would end up using it for (I thought a couple of hundred templates was big back in the day!), nor that we might use different parameters within a Tribe - the current wasteful logic is a result of that. I would really like to redesign EQcorrscan (less duplication, more time directly working on numpy arrays rather than obspy traces, making use of multithreading on GIL-releasing funcs, ..., the list goes on) but I don't know if I will ever have time.
I totally understand and I think it wasn't so easy to anticipate that all. And now I'm where happy that EQcorrscan has gotten more and more mature, especially handling all kinds of edge cases that could mess things up earlier. Using pure numpy arrays would be very nice, but would also have made debugging all the edge cases harder.

Just some more material for some thoughts: I'm attaching two cProfile-output files from a 1-day per node run that I described above. For a set of 15k templates, for 1 day of data:

with ´fast_matched_filter´ on 4x A100 (4x16 GB GPU memory, 384 GB system memory), group_size=1570 (i.e., 10 groups): detect_15500templates_1_fmf.profile.log
with ´fmf2´ on 4x AMD MI-250X (4x128 GB GPU memory, 480 GB system memory): detect_15500templates_2_fmf2.profile.log

Screenshot 2023-03-17 at 09 19 40 With ´snakeviz ´ this looks like above and gives some ideas of where a lot of time could be gained (let's ignore tribe reading because it only needs to happen once for multiple days):

´utcdatetime´ - 4 functions in top13 in terms fo ´tottime´ (actual time spent in the functions itself)
some of the duplicated preprocessing (even thought that's not so obvious from these plots / logs)
numpy-based working on traces would work a round a lot of course...

As you have already identified, it looks like a lot of time is spent copying nan-channels. Hopefully more efficient grouping in #541 will help reduce this time-sink.

You did understand me correctly with the grouping of days into chunks. Using this approach we could for example have a workflow that for each step puts the result in a multiprocessing Queue that the next step queries as it's input. This could increase parallelism by letting the necessarily serial components (getting the data, doing some of the prep work, ...) run concurrently with other tasks, with those other tasks ideally implementing parallelism using GIL-releasing threading (e.g. #540), openmp parallelism (e.g. our fftw correlations, and fmf cpu correlations) or gpu parallelism (e.g. fmf or your fmf2).

eqcorrscan / EQcorrscan

WIP: Speed up a few slowdowns when handling large datasets #522