eqcorrscan / EQcorrscan

Earthquake detection and analysis in Python.
https://eqcorrscan.readthedocs.io/en/latest/
Other
167 stars 86 forks source link

WIP: Speed up a few slowdowns when handling large datasets #522

Open flixha opened 1 year ago

flixha commented 1 year ago

This is a summary thread for a few slowdowns that I noticed when handling large-ish datasets (e.g., 15000 templates x 500 channels x 1800 samples). I'm not calling them "bottlenecks" because it's rather the sum of things together that take extra time, none of these slowdowns changes EQcorrscan fundamentally.

I will create some PRs for each point that I have a suggested solution so that we can systematically merge, improve, or reject the suggested solutions. Will add the links to the PRs here, but I'll need to organize a bit for that..

Here are some slowdowns (tests with python 3.11, all in serial code):

Is your feature request related to a problem? Please describe. All the points in the upper list occur in parts of the code where parallelization cannot help speed up execution . When running EQcorrscan on a big cluster, it's wasteful to spend as much time reorganizing data in serial as it takes to run the well parallelized template matching correlations etc.

Three more slowdowns where parallelization can help:

flixha commented 1 year ago

I have now added 5 PRs for the upper list that you can have a look at - the links to the PRs are added to the list.

I'll prepare the PRs for the lower list (points 8-10.) in a bit; but these could maybe benefit from some extra ideas.

calum-chamberlain commented 1 year ago

Awesome @flixha - thanks for this! Sorry for no response - hopefully you received at least one out of office email from me - I have been in the field for about the last month and it is now our shutdown period so I likely won't get to this until January. Sorry again, and thanks for all of this - this looks like it is going to be very valuable!

calum-chamberlain commented 1 year ago

Hey @flixha - Happy new year! I'm not back at work yet, but I wanted to get the CI tests running again to try and help you out this those PRs - it looks like a few PRs are either failing or need tests. Happy to have more of a look in the next week or so.

flixha commented 1 year ago

Happy New Year @calum-chamberlain ! Yes I did see that you were in the field, and since I posted the PRs just before the holiday break I certainly did not expect a very quick reply - was just a good way for me to finally cut these changes into compact PRs. Thanks a lot for already fixing the tests and going through all the PRs!

flixha commented 1 year ago

I'm looking into two more points right now which can slow down things for larger datasets:

  1. matched_filter loops over all templates to make families with the related detections. When adding the family to the party (see this line), there is another loop in which the template for the family being added is compared to each template that already forms a family in the party. That comparison comprises a check of all properties including the trace data and event properties, hence is not super cheap. And it's scaling with O^2 related to the number of templates/families, so in an example with 15500 templates there were 544 561 637 calls to template.__eq__, costing in total 250 seconds).
    • I think we don't really need the full check for existing families in matched_filter as long as the tribe doesn't contain duplicated templates. But for that we have other checks already, so here a simple concatenation without the checks should suffice?
    • i.e., party.families.append(family) instead of party += family
  2. For the full dataset, utils.correlate._get_array_dicts takes 400 seconds to prepare the data, which is in part due to handling a lot of UTCDateTime, sorting the traces in each template, and other list / numpy operations. I'm looking into how to speed that up... (PR here: https://github.com/eqcorrscan/EQcorrscan/pull/536)
calum-chamberlain commented 1 year ago

@flixha - do you make use of the group_size argument to split your large template sets into groups for matched-filtering? I am starting to work on changing how the different steps of the process are executed to try and make better use of cpu time.

As you well know, a lot of the operations before getting to the correlation stage are done in serial, and a lot of them do not make sense to port to parallel processing. What I am thinking about is implementing something similar to how seisbench uses either asynio or multiprocessing to process the next chunk of data while other operations are taking place. See their baseclass for an example of how they implement this. I don't think using asyncio would make sense for EQcorrscan because most of the slow points are not io bound, so using multiprocessing makes more sense. By doing this we should spend less time in purely serial code, and should see speed-ups (I think). In part this is motivated by me getting told off by HPC maintainers for not hammering their machines enough!

I'm keen to do this and there are two options I see:

  1. Implementing at a group level, such that the next group's steps are processed in advance. This would work on the same chunk of data, so you would only see speed-ups for large template sets.
  2. Implementing at a higher level across chunks of data, possibly in Tribe.client_detect such that the next chunk of data is downloaded and processed in advance. I'm not sure that these two options are mutually exclusive, but it would take some thinking to make sure that I'm not trying to create child processes from child processes.

Keen to hear your thoughts. I think this would represent a fairly significant refactor of the code in Tribe and matched_filter so I was hoping to work on this once we have merged all of your speed-ups to avoid conflicts.

flixha commented 1 year ago

Hi @calum-chamberlain , yes I do use the group_size in more or less all the bigger problem sets as I'd otherwise run out of memory. For a set of 15k templates I was using group_size=1570 (i.e., 10 groups) with fast_matched_filter on 4x A100 (4x16 GB GPU memory, 384 GB system memory), while with fmf2 on 4x AMD MI-250X (4x128 GB GPU memory, 480 GB system memory) I was using group_size=5500 (3 groups; caveat: fmf and fmf2 may duplicate some memory unnecessarily across multiple GPUs, but that's a minor thing) - just for some examples (don't remember the biggest fftw examples right now). So of course the preparation of the data before starting the correlations for a group does take quite a bit of time, so good to hear you're thinking about this.

Just some quick thoughts (may add more tomorrow when I'm fully awake ;-)):

flixha commented 1 year ago
  * On Slurm clusters, I run array jobs to chunk the problem at the highest level (across time periods), so that I distribute the serial parts of the code across as many array tasks as possible. Usually I can fit at most 3 array tasks into one node, and mostly just one, due to the memory requirements.

To specify: for the 15k templates I could only fit one array task per node; while for picking with the same tribe I could fit 2 or 3, depending on the node configuration.

calum-chamberlain commented 1 year ago

Thanks for those points.

flixha commented 1 year ago
* I thought we were grouping somewhat sensibly, but it might just be grouping based on processing rather than the most consistent channel sets - this should definitely be done.

I think it's only processing parameters right now, so indeed it should be worth changing that

* I do similar with splitting chunks of time across nodes. However I have been chunking my datasets into `ndays` where `ndays` is typically `total days / nodes` rather than splitting day by day. In that regard, eagerly processing the next timestep of data would be fine because it would stay on the same node.

If I understand correctly, each node gets as a task e.g. "work on these 30 days", is that correct? Indeed that's how I do it; restarting the process for each day would be too costly for reading the tribe and some other metadata (but maybe I misunderstood)

* I have been wondering for a while about using [dask's](https://docs.dask.org/en/stable/) schedulers for working across clusters. At some point I would like to explore this more, but I think it might be worth making a new project that uses EQcorrscan. I haven't managed to get the schedulers to work how I expect them to work either.

That's probably the more sensible way to go for now. As an additional point to this; right now it's easy for the user to their own preparations on all the streams according to their processing before calling EQcorrscan's main detection functions; with dask I imagine the order of things and how to do them would need some thought (set up / distribute multi-node jobs; let user do some early processing; start preprocessing / detection functions)

* Agree on duplicated work. When EQcorrscan started I did not anticipate either the scales of template sets that we would end up using it for (I thought a couple of hundred templates was big back in the day!), nor that we might use different parameters within a Tribe - the current wasteful logic is a result of that. I would really like to redesign EQcorrscan (less duplication, more time directly working on numpy arrays rather than obspy traces, making use of multithreading on GIL-releasing funcs, ..., the list goes on) but I don't know if I will ever have time.

I totally understand and I think it wasn't so easy to anticipate that all. And now I'm where happy that EQcorrscan has gotten more and more mature, especially handling all kinds of edge cases that could mess things up earlier. Using pure numpy arrays would be very nice, but would also have made debugging all the edge cases harder.

Just some more material for some thoughts: I'm attaching two cProfile-output files from a 1-day per node run that I described above. For a set of 15k templates, for 1 day of data:

  1. with ´fast_matched_filter´ on 4x A100 (4x16 GB GPU memory, 384 GB system memory), group_size=1570 (i.e., 10 groups): detect_15500templates_1_fmf.profile.log
  2. with ´fmf2´ on 4x AMD MI-250X (4x128 GB GPU memory, 480 GB system memory): detect_15500templates_2_fmf2.profile.log

Screenshot 2023-03-17 at 09 19 40 With ´snakeviz ´ this looks like above and gives some ideas of where a lot of time could be gained (let's ignore tribe reading because it only needs to happen once for multiple days):

calum-chamberlain commented 1 year ago

As you have already identified, it looks like a lot of time is spent copying nan-channels. Hopefully more efficient grouping in #541 will help reduce this time-sink.

You did understand me correctly with the grouping of days into chunks. Using this approach we could for example have a workflow that for each step puts the result in a multiprocessing Queue that the next step queries as it's input. This could increase parallelism by letting the necessarily serial components (getting the data, doing some of the prep work, ...) run concurrently with other tasks, with those other tasks ideally implementing parallelism using GIL-releasing threading (e.g. #540), openmp parallelism (e.g. our fftw correlations, and fmf cpu correlations) or gpu parallelism (e.g. fmf or your fmf2).