WIP: Support more metadata for templates and each template trace

flixha commented 2 years ago

What does this PR do?

Implement first ideas / attempts for adding more metadata to templates, including trace-specific metadata.

Main points:

trace.stats.extra is populated with trace metadata during template generation / template reading
to read / write the trace metadata in a tribe tar-archive, the values are collected from the traces and stored in an AttribDict in event.extra.
trace metadata added for now: weight, peak_snr, rms_snr, starttime, endtime, length_npts (i.e., the actual starttime / endtime / npts of each trace BEFORE writing, to allow checks when reading)

Why was it initiated? Any relevant Issues?

We've been mentioning for a while that we'd like to save more metadata for each template, and also each trace in the templates. We didn't yet have a way of handling trace-specific metadata that do not fit into miniseed-supported header fields.

Saving metadata for each trace should allow some more tasks:

get ready to support weights for each trace in each template in matched_filter
easy fix for https://github.com/eqcorrscan/EQcorrscan/issues/497
get ready to extend functionality for some of the wishes listed here: https://github.com/eqcorrscan/EQcorrscan/issues/445, e.g.:
get ready to support multiple lag-calc picks on same trace-id
this could also simplify matching picks to their corresponding traces

Possible downsides:

saving trace metadata in the event works well with QuakeML files, but it would not work so easily if we'd like to support saving tribes with Nordic format event files. I don't know how it would be for sc3ml files yet.

@calum-chamberlain , @d-chambers: What do you think about saving trace metadata like this? I went through a few other ideas, but they all seemed to have bigger downsides.

Example:

from eqcorrscan.core.match_filter import Tribe
from obspy.clients.fdsn import Client

client = Client('NCEDC')
catalog = client.get_events(eventid='72572665', includearrivals=True)
catalog[0].picks = catalog[0].picks[0:5]
tribe = Tribe().construct(
    method='from_client', catalog=catalog, client_id='NCEDC', lowcut=2.0,
    highcut=8.0,  samp_rate=20.0, filt_order=4, length=6.0, prepick=0.1,
    swin='all', process_len=3600, all_horiz=True)

tribe.write('test_tribe.tgz')
tribe_in = Tribe().read('test_tribe.tgz')
tribe[0].st[0].stats.extra
tribe_in[0].st[0].stats.extra

PR Checklist

[X] develop base branch selected?
[ ] This PR is not directly related to an existing issue (which has no PR yet).
[X] All tests still pass.
[ ] Any new features or fixed regressions are be covered via new tests.
[ ] Any new or changed features have are fully documented.
[ ] Significant changes have been added to CHANGES.md.
[ ] First time contributors have added your name to CONTRIBUTORS.md.

flixha commented 2 years ago

tests fail due to Scipy 1.9.0; I was hoping that scipy>=0.18,<1.9.0 would fix it for now but doesn't seem to work...

calum-chamberlain commented 2 years ago

event.extra seems like a good place for this - but if you are concerned about supporting other formats for event files then would a separate file that contains the extra Template-specific metadata be worthwhile. We already write a csv with some barebones info with Tribe._par_write - we could change this to writing a json file that has more meta-info and would allow for expansion in the future. I imagine that you might have thought of this and the potential downsides already?

@cjhopp might have opinions on what he would like included in template metadata as well.

flixha commented 2 years ago

Well, I have to admit that I didn't consider Tribe._par_write as much; and indeed from what I can think of now it should work just as well, especially if we write it to a json.

I guess I was focusing more on where to keep the trace-stats - and came to the conclusion that they have to be added to the traces themselves because the two other options I saw, [1: referencing by resource ID (not thread safe), 2: referencing by waveform ID or ID + pick (not unique / cumbersome)] seemed less optimal.

Even though it works ok now in the QuakeML file, one thing I don't like perfectly there is that the list of parameters for each trace is read in as a string, and then needs to be evaluated to a list again: https://github.com/eqcorrscan/EQcorrscan/blob/ee8a94c4b9ed4bbc7dc28c3c7fca8f08c33487f4/eqcorrscan/core/match_filter/tribe.py#L444-L446

So any opinions on the choice:

write to QuakeML?
write to csv?
write to new json file, instead of the csv?

calum-chamberlain commented 2 years ago

I'm fairly keen to write to a json, because I think it will allow us more expansion later, without having to do strange string mangling out of extras. Internally though, I think you are right to have the metadata associated with the trace itself, that definitely makes sense. It may make sense to have a phase-hint associated with each trace at some point as well when we end up having multiple traces from a single channel?

flixha commented 2 years ago

Thanks for the feedback; i'll move the trace-metadata to a json together with the previously existing template metadata. Will need to wait until late next week when I'm back from the field though

flixha commented 1 year ago

Just a note that this PR is not forgotten; I just haven't gotten around to writing it such that the metadata are written to a json file. Note to self:

the extended metadata-file should include more information on the trace processing, either the whole processing history from tr.stats.processing, or a user-provided dict with any processing steps that are not covered by the current processing-metadata.
in particular, it's important to save whether a trace had its response removed in case the traces are used for relative amplitude / magnitude calculation

calum-chamberlain commented 1 month ago

Hey @flixha - I'm keen to pick this back up because I have a need to get weights going for a project. Do you have any more you want to do to this, or would you like me to pick this up and run with it?

eqcorrscan / EQcorrscan