eqcorrscan / EQcorrscan

Earthquake detection and analysis in Python.
https://eqcorrscan.readthedocs.io/en/latest/
Other
166 stars 86 forks source link

custom process-function options #273

Open calum-chamberlain opened 6 years ago

calum-chamberlain commented 6 years ago

Is your feature request related to a problem? Please describe. Gaps in seismic data cause most of the issues with normalisation of correlations. EQcorrscan's pre_processing functions take care of gaps pretty well now, but it would be good to expose users to how gaps are handled so that they can easily write their own custom process functions (e.g. adding processing steps, or using a different type of filtering, decimating rather than resampling...) that also handle gaps in the way the correlation functions expect.

It would also be useful if the match_filter objects allowed a custom process-function to be specified (in a similar way that users can specify any correlation function they want). This would allow more people to use those objects.

Describe the solution you'd like

  1. Refactor gap handling as a context manager;
  2. Provide a new keyword arg for match-filter objects of process_func.

Both would require docs and tutorials to make it clear how they should be used - in general the docs are in real need of a tidy.

  1. pre_processing._fill_gaps and pre_processing._zero_pad_gaps would be repurposed as __enter__ and __exit__ functions on a HandleGaps context manager. The API would end up looking something like this:
    
    from eqcorrscan.pre_processing import HandleGaps

with HandleGaps(tr): custom_processing(tr)



2. Would be fairly simple, add an extra argument, and check when calling the processing functions if it had been set, otherwise, use the inbuilt processing functions.
d-chambers commented 6 years ago

Hey @calum-chamberlain,

This looks interesting. I love the idea of simply doing the thing most people will want by default, but allowing users to modify the default behaviour when needed. A few thoughts/questions:

  1. Its probably better to have the __enter__ method return a trace/stream rather than assuming it will operate in place. This will allow the logic of the pre-processor to operate in place or not, and then just return the resulting object. So the API could look something like this:
from eqcorrscan.pre_processing import HandleGaps

with HandleGaps(tr) as trr:
    custom_processing(trr)
  1. What would the clean-up of the context manager do? The main strength of the context manager is ensuring the __exit__ method gets called regardless of unhandled exceptions. Is there something you had in mind that needs to happen once the HandleGaps scope exists or could a function call suffice to save a level of indentation?

  2. Users may want to have several pre-processing methods in a particular order. It may be useful to provide a way to chain them together. Something similar to scikit learn's Pipeline maybe?

calum-chamberlain commented 6 years ago

Thanks for those @d-chambers (also thanks for the book recommendation, I'm chewing my way through it, and almost every page has something of great interest!).

At the moment, the function _fill_gaps is run before filtering and resampling, and _zero_pad_gaps is used after processing to fill the gaps found by _fill_gaps with zeros. I was thinking that _fill_gaps would be the equivalent of an __enter__ and _zero_pad_gaps would be used as __exit__. What these functions do is:

  1. That pipline idea looks interesting, not sure how I would implement it, but could be something fun in the future.
d-chambers commented 6 years ago

No problem, that book is incredible, I learned a ton from it. There are still parts of it, especially the async stuff, that I am struggling to wrap my head around.

Ok so _zero_pad_gaps actually acts on the resulting correlogram correct? Ya, that makes sense to me.

calum-chamberlain commented 6 years ago

Ah, no, _zero_pad_gaps just works on the trace data... this would just encompass a pre-processing (filter and resample, not correlate) process... Does that make sense? The flow is something like:

  1. Read in data that has some gaps into a Stream with multiple segments;
  2. Call _fill_gaps to make the data continuous;
  3. Filter, resample and anything else;
  4. Call _zero_pad_gaps to cut out data from the gap positions determined by _fill_gaps, and replace with zeros.
  5. Call match-filter, the correlation function returns zeros when there are fewer than two non-zero samples in the correlation window.

I was imagining having step 1 as __enter__ and step 4 as __exit__. It's not easy to edit the correlogram because the stacked correlogram is returned for memory efficiency.

d-chambers commented 6 years ago

So the context manager is specifically to enforce _fill_gaps being called first and _zero_pad_gaps being called last in the preprocessing correct? I can see why that would be useful and it does seem like a good fit for a context manager to me.

calum-chamberlain commented 6 years ago

Yup, that's it - I'm hoping that it would be simple for people to use as well - its the only bit of the processing functions that I would say is really required for the correlation functions. Everything else could/should be personal preference.

calum-chamberlain commented 5 years ago

Playing around more with this, I don't think the context manager fits and I'm just going to expose the (previously "private") gap handling functions.

calum-chamberlain commented 5 years ago

Working on adding custom processing functions here