eqcorrscan / EQcorrscan

Earthquake detection and analysis in Python.
https://eqcorrscan.readthedocs.io/en/latest/
Other
166 stars 86 forks source link

Template and selfdetection are different #573

Open HGeoAle opened 4 months ago

HGeoAle commented 4 months ago

obsplus version 0.3.0 eqcorrscan 0.5.0 obspy version 1.4.1 numpy version 1.26.4

After noticing that the sum correlation detection value was not exactly the number of channels (each channel correlation should be 1) on a self detection I experimented to find out why. I took one event , build my tribe, ran detections and with return_stream=True, processed the stream and used it to extract the self detection stream and found the following: -Of 61 traces, 58 have the same values (traces.data) but they are consistently different in length. The templates starts one time delta earlier and the self detection are one sample longer.

To Reproduce Once I have the Catalog and bank loaded I do the following:

starttime = UTCDateTime("2020-01-08 00:00:00.00")
endtime = UTCDateTime(starttime + 86383)
tribe = Tribe().construct(
    method="from_client",
    client_id=bank,
    catalog=cat,
    lowcut=2.0,
    highcut=16.0,
    samp_rate=50.0,
    filt_order=4,
    length=10.0,
    prepick=3,
    swin="all",
    all_horiz=False,
    min_snr=4.0,
    parallel=True,
    daylong = True)
party, st = tribe.client_detect(
    client=bank,
    starttime=starttime,
    endtime=endtime,
    threshold=20,
    threshold_type="absolute",
    trig_int=1.0,
    return_stream=True,
    daylong=True,
    )
processed_st = multi_process(st.copy(), lowcut=2, highcut=16, filt_order=4, samp_rate=50, daylong=True) # process stream with the same parameters
family = party[0]
detections = family.detections
highest_index = max(range(len(family.detections)), key=lambda i: family.detections[i].detect_val) #Find self detection
template_stream =tribe[0].st.copy()
self_detection_stream = family.detections[highest_index ].extract_stream(processed_st, length = 10, prepick=3)

for tr1, tr2 in zip(template_stream, self_detection_stream):
    print(len(tr1.data))
    print(len(tr2.data))
    print(tr1.data[:5])
    print(tr2.data[:5])
    print("-------------")

Expected behavior I expected the self detection and the template to be identical and that the correlation coefficient sum is exactly the number of channels

Desktop (please complete the following information):

Additional context I also notices that the multi_processing step must have daylong=True to this to somehow work, otherwise the traces would be totally different.

calum-chamberlain commented 3 months ago

The correlation sum will frequently not be exactly equal to the number of channels for a self-detection due to small rounding errors - you should expect that the correlation sum is close to the number of channels though, however from your description it sounds like you think there is an issue in how the data are processed.

Without having access to your bank and catalog it is impossible to verify this. Can you please provide an example that I can test and debug, either using an open online catalog, or by providing the necessary data and a complete working example.

The difference may stem from how the start and end-times are handled given that your end time is < 86400 seconds after your start time (and hence not a full day).

HGeoAle commented 2 months ago

Sorry for the long wait. I needed to check the specific conditions of my data that were producing the errors in order to reproduce them and explain myself better.

Here you could find a data set with one event, few stations and waveforms to use to reproduce the problems I had: https://www.dropbox.com/scl/fi/hoc0n9sclyja9trmgi3tl/waveform_bugreport.zip?rlkey=x3ks0p93vmtl5vomgz1sbylx4&st=yzgzrr7j&dl=0

This includes an script that reproduces the problems, a catalog with one event, the picks of such event, and 2 days of waveforms of one station.

I found 3 problems, the first two problem are related to my issue described before:

1) trying to process chunks of exactly 86400 seconds of data with daylong=True when using daylong=True for the detection and using a client with exactly or more than day of data an error arises:

File "(..)/bug_report/bug_report.py", line 95, in <module>
    party, st = tribe.client_detect(
                ^^^^^^^^^^^^^^^^^^^^
  File "(...)/eqcorrscan/core/match_filter/tribe.py", line 1448, in client_detect
    party += self.detect(stream=st, pre_processed=False,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "(...)/eqcorrscan/core/match_filter/tribe.py", line 879, in detect
    party = self._detect_serial(*args, **inner_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "(...)/eqcorrscan/core/match_filter/tribe.py", line 914, in _detect_serial
    st_chunks = _pre_process(
                ^^^^^^^^^^^^^
  File "(...)/eqcorrscan/core/match_filter/helpers/tribe.py", line 241, in _pre_process
    st_chunks = _group_process(
                ^^^^^^^^^^^^^^^
  File "(...)/eqcorrscan/utils/pre_processing.py", line 921, in _group_process
    if _endtime < stream[0].stats.endtime:
                  ~~~~~~^^^
  File "/obspy/core/stream.py", line 645, in __getitem__
    return self.traces.__getitem__(index)

In this instance, this error happens because the data is exatly 86400 seconds long (from 2020-01-08 00:00:00 to 2020-01-09 00:00:00). When processing chunks of data, the code evaluates the portion of stream left, in /eqcorrscanutils/preprocessing.py, line 921:

if _endtime < stream[0].stats.endtime:
        Logger.warning(
            "Last bit of data between {0} and {1} will go unused "
            "because it is shorter than a chunk of {2} s".format(
                _endtime, stream[0].stats.endtime, process_length))

This is after trimming out the already processed chunks from the stream. If the last chunk happen to finish exactly where the stream finishes, the last triming will trim the last portion of data and and the stream becomes an empty stream. The empty stream loses all its attributes, so when trying to get stream[0].stats.endtime produces an error of list index out of range.

There are different ways I went around this error: 1) Use an number of second for the data different from 86400 (example used on the issue description before) 2) Use a bank with only 1 day of data that doesn have samples over or exactly at midgnith of the next day. 3) Not to use a bank and simply load my traces as files, different files for different days. 4) Not use 'daylong=True' when detecting

2) fixing lengths, shifting the templates one sample The processing code does a "length sanitation" several times in different parts of the code. For example in line 130 of the pre_processing.py module:

if len(tr.data) == ((endtime - starttime) *
                                    tr.stats.sampling_rate) + 1:
                    Logger.info(f"{tr.id} is overlength dropping first sample")
                    tr.data = tr.data[1:len(tr.data)]
                    # TODO: this should adjust the start-time
                    # tr.stats.starttime += tr.stats.delta

This length sanitation drops the first sample, but doesn't modify the starttime of the trace, moving the second sample to the starttime, effectively shifting the trace one sample back in time. This happens using daylong=True and the starttime and endttime cover 86400 seconds exactly with one sample at the begginign and one at the end. For instance, from 2020-01-08 00:00:00(first sample) to 2020-01-09 00:00:00(last sample).

At the moment I compromise to get the one sample shift, therefore my selfdetections usually happens 0.02 seconds from the template. I compromise because under this conditions the template and the selfdetection traces are exactly the same by the exception of the one sample shift.

3) Lag calc with specific shift len leads to drop of channels I found it when applying lag calc after getting the detection. If the shinft_len exactly divides the length of the template, then some channels are dropped and not used. This lead to many detections that end with very few or no picks in the new_catalog. I could not find exactly where this error occurs but I suspect that is somewhere when preparing the data, apparently it discards some channels.

In the given example with one station, the channels for this stations are discarded leading to a catalog with no picks whatsoever. this message is given 2024-07-10 11:20:44,869 eqcorrscan.core.lag_calc WARNING No appropriate data found, check your family and detections - make sure seed ids match The way around I found to this problem is to simply use a different shif_length. (0.25 triggers to the problem, 0.26 doesn't)

calum-chamberlain commented 2 months ago

Thanks for all your work documenting those issues and sorry once again that you ran into them!

1) The solution of not using daylong is the right one - daylong is a left over from earlier versions of EQcorrscan and I should have depreciated it. In general I would recommend that you do not use daylong at any point, but set process_len=86400 when creating the tribe and then this will be carried through. See #580 for the PR that now removes the daylong option.

For context, the daylong argument was included to expedite chunking of data into commonly stored daylong-chunks. This is no longer needed with the use of clients to manage data access. Furthermore, splitting data at day boundaries with no overlap can result in missed detections close to day boundaries. The current implementation enforces overlaps at the start and end of chunks to avoid missing detections between chunks.

2) I'm not sure why I left that bit undone at line 130 - that should be corrected. However, when I run your code without daylong, then I get the self detection at the correct time (albeit with a slightly suppressed correlation which is likely due to the precision of the internal correlation routines).

3) This is really annoying! Sorry - I will do some more digging into this. I have noticed that I am starting to see this in a few cases as well. I will update you with a fix.

Thanks again for your help with this!

calum-chamberlain commented 2 months ago

It looks like point 3 stems from the way obspy trims/slices data when the selected times do not align with a sample. There isn't an obvious way to handle this how I want it to be handled in obspy, so I am working on a patch to EQcorrscan to select the nearest sample before the selected start-time as the start of the cut, then trimming to length from there. I'm working on this here: #582 - feel free to pull that branch and test with your data.

calum-chamberlain commented 2 months ago

I think #582 should fix your issue with lag-calc. It seems to work for me with your dataset and I have added a test to confirm that these part-sample trims work as expected. I will leave this open for a couple of weeks then close this issue if there isn't a response (unless I forget to close it!).