eqcorrscan / EQcorrscan

Earthquake detection and analysis in Python.
https://eqcorrscan.readthedocs.io/en/latest/
Other
166 stars 86 forks source link

What fundamental restrictions of templates, detection & lag-calc do we want to solve in the future? #445

Open flixha opened 3 years ago

flixha commented 3 years ago

I'm opening this issue with the intent to collect information and discuss a suitable implementation strategy. So this is very much Work in Progress and I would appreciate any input to add to the current restrictions, related features wishes, and proposed implementation strategies.

Current restrictions The current implementation of templates, detection, cross-correlation, and lag-calc has some restrictions including:

  1. all template channels need the same length
  2. P-picks are expected on Z-channel and S-picks on horizontal channels (some methods care only a bit about this, while for other methods it's fundamental
  3. only one P- and one S-phase can be picked right now
  4. all channels are weighted equally in cross-correlation
  5. computing relative_magnitudes doesn't work right now because templates don't contain enough information to compute pre-event noise (see #385)
  6. match_filterand lag_calc always use all available channels of a template, without options for discrimination
  7. once a template is created, match_filterand lag_calc use the full length of the traces in the template for correlation

Is your feature request related to a problem? Please describe. It would be nice if we could support the following functionality in the future:

  1. define templates with channels of "arbitrary" length
  2. allow templates where P-picks and S-picks can appear on any channel (especially important when using single-component data)
  3. support handling of additional phases like Pg, Pn etc. in template-creation, correlation, and lag-calc picking. (may want support for any kind of phase, e.g., teleseismic?)
  4. allow arbitrary weighting of the template channels, and provide some standard ways to weight (e.g., equal weights; weighted by SNR)
  5. store metadata like SNR in the template class
  6. providing templates to match_filter and lag_calc should come with options to let e.g. match_filter only run on a subset of the channels, while keeping all channels in the template (e.g., to detect with a subset for computational efficiency, but pick with all available channels)
  7. providing templates to match_filterand lag_calc should come with an option to use only a fraction of the channels for correlation (e.g., to detect with the full seismogram, but pick with only the phase onset)

Describe the solution you'd like For now, I'm opening this issue to collect a list of challenges, missing features, and requirements that would all be related, and hopefully addressed, in an update to the template class. I suspect that templates will need to contain some more metadata for us to allow the features described above. I think it would be desirable if we can get away with only one major upgrade to the template class so that we break compatibility with old templates only during one version update.

Describe alternatives you've considered None yet - but this issue will likely evolve and I'm planning to fill these in here!

Updates: 2021-03-15: first version 2021-03-17: added points 6 & 7

Relevant previous issues: #169

calum-chamberlain commented 3 years ago

One thing that I would like to be able to do is have a template with all the information in it, but only use certain channels for detection, then all for subsequent lag-calc. I want to do this to be able to reduce computational requirements for long-running detection runs, but then be able to maintain the meta-data for picking at the end.

flixha commented 3 years ago

That's a good point! I added it in the list above, and I also added a point that it would be nice to later be able to ask lag_calc (and maybe match_filter) to use only a fraction or specific length of the traces in the template. This could make sense if one wants to correlate long traces to achieve efficient MAD-based discrimination of real events vs. misdetections, but then one may prefer to do the picking with shorter traces to achieve higher single-trace correlation coefficients. This would also allow to better handle phases that arrive close in time during picking (e.g., Pn and Pg).

calum-chamberlain commented 3 years ago

While this is open - other things that I want to do include:

  1. Moving away from using obspy Event objects for Detections - there should be a way to get events from detections (e.g. a gettable property on a Detection), but they are quite heavy in memory and have more flexibility and possibilities than we need internally;
  2. Develop faster IO for Tribe and Party objects - a major part of this would be moving away from full obspy Event objects - currently QuakeML reading is really quite slow for large catalogs (as discussed in this issue back in 2017).
    • It might be worth writing a maximum of N events to a file, and writing multiple files, then reading via threading.
  3. Move away from obspy Trace and Stream objects internally - again, Template objects should have a stream gettable property, but we can just use numpy arrays internally and maybe eventually move to using numba for to jit some of the slow code. If we work with numpy array directly we can also drop in cupy when appropriate/available to get GPU speed-ups.
    • Side-note, I'm slowly working on ObsPy accelerations using cupy if anyone is keen to see/test/help then let me know.

I think we probably only need a small set of Event attributes, something like a SparseEvent class that has:

  1. Event ID
  2. Picks (time, seed id, phase hint, id)
  3. Origin (latitude, longitude, depth, time, id)
  4. Magnitude (magnitude, type, id)
  5. Maybe other things...?

The SparseEvent should be supplemented by SparsePick, SparseMagnitude and SparseOrigin classes, and should effectively look like an obspy Event (so have things like .preferred_origin(), ... to reduce the need to rewrite large chunks of code - writing could be to json with dict deserialisation via a fixed set of values.

flixha commented 3 years ago
  1. Sounds like a great idea to have ´SparseEvent´ etc. for internal usage. I could see that it may be useful for some cases if it also contained ´SparseAmplitude´ and ´SparseStationMagnitude´.Sounds like a great idea to have ´SparseEvent´ etc. for internal usage. I could see that it may be useful for some cases if it also contained ´SparseAmplitude´ and ´SparseStationMagnitude´.
  2. I guess there should be a way to put the files written in parallel into one single .tgz-archive in the end, which would keep things really simple for the user.
  3. Probably a good idea, especially considering how much time is spent in serial python code when preparing traces and streams for further processing. I'm wondering whether the numpy-representations would be suitable for some more vectorized operations.