Open flixha opened 1 year ago
I have now added 5 PRs for the upper list that you can have a look at - the links to the PRs are added to the list.
I'll prepare the PRs for the lower list (points 8-10.) in a bit; but these could maybe benefit from some extra ideas.
Awesome @flixha - thanks for this! Sorry for no response - hopefully you received at least one out of office email from me - I have been in the field for about the last month and it is now our shutdown period so I likely won't get to this until January. Sorry again, and thanks for all of this - this looks like it is going to be very valuable!
Hey @flixha - Happy new year! I'm not back at work yet, but I wanted to get the CI tests running again to try and help you out this those PRs - it looks like a few PRs are either failing or need tests. Happy to have more of a look in the next week or so.
Happy New Year @calum-chamberlain ! Yes I did see that you were in the field, and since I posted the PRs just before the holiday break I certainly did not expect a very quick reply - was just a good way for me to finally cut these changes into compact PRs. Thanks a lot for already fixing the tests and going through all the PRs!
I'm looking into two more points right now which can slow down things for larger datasets:
matched_filter
loops over all templates to make families with the related detections. When adding the family to the party (see this line), there is another loop in which the template for the family being added is compared to each template that already forms a family in the party. That comparison comprises a check of all properties including the trace data and event properties, hence is not super cheap. And it's scaling with O^2 related to the number of templates/families, so in an example with 15500 templates there were 544 561 637 calls to template.__eq__
, costing in total 250 seconds).
matched_filter
as long as the tribe doesn't contain duplicated templates. But for that we have other checks already, so here a simple concatenation without the checks should suffice?party.families.append(family)
instead of party += family
utils.correlate._get_array_dicts
takes 400 seconds to prepare the data, which is in part due to handling a lot of UTCDateTime
, sorting the traces in each template, and other list / numpy operations. I'm looking into how to speed that up... (PR here: https://github.com/eqcorrscan/EQcorrscan/pull/536)@flixha - do you make use of the group_size
argument to split your large template sets into groups for matched-filtering? I am starting to work on changing how the different steps of the process are executed to try and make better use of cpu time.
As you well know, a lot of the operations before getting to the correlation stage are done in serial, and a lot of them do not make sense to port to parallel processing. What I am thinking about is implementing something similar to how seisbench uses either asynio or multiprocessing to process the next chunk of data while other operations are taking place. See their baseclass for an example of how they implement this. I don't think using asyncio would make sense for EQcorrscan because most of the slow points are not io bound, so using multiprocessing makes more sense. By doing this we should spend less time in purely serial code, and should see speed-ups (I think). In part this is motivated by me getting told off by HPC maintainers for not hammering their machines enough!
I'm keen to do this and there are two options I see:
Tribe.client_detect
such that the next chunk of data is downloaded and processed in advance.
I'm not sure that these two options are mutually exclusive, but it would take some thinking to make sure that I'm not trying to create child processes from child processes.Keen to hear your thoughts. I think this would represent a fairly significant refactor of the code in Tribe
and matched_filter
so I was hoping to work on this once we have merged all of your speed-ups to avoid conflicts.
Hi @calum-chamberlain ,
yes I do use the group_size
in more or less all the bigger problem sets as I'd otherwise run out of memory. For a set of 15k templates I was using group_size=1570
(i.e., 10 groups) with fast_matched_filter
on 4x A100 (4x16 GB GPU memory, 384 GB system memory), while with fmf2
on 4x AMD MI-250X (4x128 GB GPU memory, 480 GB system memory) I was using group_size=5500
(3 groups; caveat: fmf and fmf2 may duplicate some memory unnecessarily across multiple GPUs, but that's a minor thing) - just for some examples (don't remember the biggest fftw examples right now). So of course the preparation of the data before starting the correlations for a group does take quite a bit of time, so good to hear you're thinking about this.
Just some quick thoughts (may add more tomorrow when I'm fully awake ;-)):
* On Slurm clusters, I run array jobs to chunk the problem at the highest level (across time periods), so that I distribute the serial parts of the code across as many array tasks as possible. Usually I can fit at most 3 array tasks into one node, and mostly just one, due to the memory requirements.
To specify: for the 15k templates I could only fit one array task per node; while for picking with the same tribe I could fit 2 or 3, depending on the node configuration.
Thanks for those points.
ndays
where ndays
is typically total days / nodes
rather than splitting day by day. In that regard, eagerly processing the next timestep of data would be fine because it would stay on the same node.* I thought we were grouping somewhat sensibly, but it might just be grouping based on processing rather than the most consistent channel sets - this should definitely be done.
I think it's only processing parameters right now, so indeed it should be worth changing that
* I do similar with splitting chunks of time across nodes. However I have been chunking my datasets into `ndays` where `ndays` is typically `total days / nodes` rather than splitting day by day. In that regard, eagerly processing the next timestep of data would be fine because it would stay on the same node.
If I understand correctly, each node gets as a task e.g. "work on these 30 days", is that correct? Indeed that's how I do it; restarting the process for each day would be too costly for reading the tribe and some other metadata (but maybe I misunderstood)
* I have been wondering for a while about using [dask's](https://docs.dask.org/en/stable/) schedulers for working across clusters. At some point I would like to explore this more, but I think it might be worth making a new project that uses EQcorrscan. I haven't managed to get the schedulers to work how I expect them to work either.
That's probably the more sensible way to go for now. As an additional point to this; right now it's easy for the user to their own preparations on all the streams according to their processing before calling EQcorrscan's main detection functions; with dask I imagine the order of things and how to do them would need some thought (set up / distribute multi-node jobs; let user do some early processing; start preprocessing / detection functions)
* Agree on duplicated work. When EQcorrscan started I did not anticipate either the scales of template sets that we would end up using it for (I thought a couple of hundred templates was big back in the day!), nor that we might use different parameters within a Tribe - the current wasteful logic is a result of that. I would really like to redesign EQcorrscan (less duplication, more time directly working on numpy arrays rather than obspy traces, making use of multithreading on GIL-releasing funcs, ..., the list goes on) but I don't know if I will ever have time.
I totally understand and I think it wasn't so easy to anticipate that all. And now I'm where happy that EQcorrscan has gotten more and more mature, especially handling all kinds of edge cases that could mess things up earlier. Using pure numpy arrays would be very nice, but would also have made debugging all the edge cases harder.
Just some more material for some thoughts: I'm attaching two cProfile-output files from a 1-day per node run that I described above. For a set of 15k templates, for 1 day of data:
With ´snakeviz ´ this looks like above and gives some ideas of where a lot of time could be gained (let's ignore tribe reading because it only needs to happen once for multiple days):
As you have already identified, it looks like a lot of time is spent copying nan-channels. Hopefully more efficient grouping in #541 will help reduce this time-sink.
You did understand me correctly with the grouping of days into chunks. Using this approach we could for example have a workflow that for each step puts the result in a multiprocessing Queue that the next step queries as it's input. This could increase parallelism by letting the necessarily serial components (getting the data, doing some of the prep work, ...) run concurrently with other tasks, with those other tasks ideally implementing parallelism using GIL-releasing threading (e.g. #540), openmp parallelism (e.g. our fftw correlations, and fmf cpu correlations) or gpu parallelism (e.g. fmf or your fmf2).
This is a summary thread for a few slowdowns that I noticed when handling large-ish datasets (e.g., 15000 templates x 500 channels x 1800 samples). I'm not calling them "bottlenecks" because it's rather the sum of things together that take extra time, none of these slowdowns changes EQcorrscan fundamentally.
I will create some PRs for each point that I have a suggested solution so that we can systematically merge, improve, or reject the suggested solutions. Will add the links to the PRs here, but I'll need to organize a bit for that..
Here are some slowdowns (tests with python 3.11, all in serial code):
tribe._group_templates
: 50x speed uppreprocessing._prep_data_for_correlation
: 3x speed up for function:matched_filter.match_filter
withcopy_data=True
: 3x speedup for copydetection
,lag_calc
,pre_processing
: 4x / 100x speedup for trace selectioncore.match_filter.family._uniq
: 1.9x speeduplist(set)
(1.9x speedup for 43000 detections, fastest: 3.1 s), but 1.2x slower for small sets (e.g., 430 detections; 50 ms --> 27 ms).core.match_filter.detect
- 1000x speed up for many calls tofamily._uniq
family._uniq
in a loop over all families is still rather slow with_uniq
. Checking tuples of(detection.id, detection.detect_time, detection.detect_val)
withnumpy.unique
and avoiding a loop is 1000x faster. From 752 s to <1 s for 82000 detections.matched_filter._group_detect
: 30x speedup in handling detectionsprepick
to many picks (e.g., 400k) is somewhat slow because ofUTCDateTime
overhead. ~4x speedup with adding topick.time.ns
directly.Is your feature request related to a problem? Please describe. All the points in the upper list occur in parts of the code where parallelization cannot help speed up execution . When running EQcorrscan on a big cluster, it's wasteful to spend as much time reorganizing data in serial as it takes to run the well parallelized template matching correlations etc.
Three more slowdowns where parallelization can help:
core.match_filter
: 2.5x speedup for MAD threshold calcnp.median(np.abs(cccsum))
for each cccsum takes a lot of time when there are many cccsum in cccsums. Only quicker solution I found was to parallelize the operation, which surprisingly could speed up problems bigger than ~15 cccsum already. The speedup is only ~2.5x, so even though that matters a lot for many cccsum (e.g., 2000: 20 s vs 50 s), it feels like this has more potential for even more speedup.detection._calculate_event
: 35% speedup in parallelutils.catalog_to_dd.write_correlations
: 20 % speedup using some shared memory