WFCatalog metadata dependency

jbienkowski commented 3 years ago

Currently, WFCatalog does not depend on station metadata - it calculates metrics for acquired data even if some channels are not defined in StationXML. In those cases users can retrieve the metrics, but are not able to download the data itself via FDSNWS-Dataselect web service which strongly depends on metadata.

Possible solutions:

Exclude channels that are not defined in StationXML from WFCatalog collector processing
Calculate the metrics anyways, but apply filtering on the web service side
Cross-check on metadata in the downstream product (apparently the original approach)
Add a metadata query parameter with default value true in the WFCatalog implementation which would still allow retrieval of all available metrics

damb commented 3 years ago

@jbienkowski, StationXML metadata validation may be optionally enabled/disabled. So it depends on the user of the software if there waveform catalog metadata is served or not. If the validation configuration changes, data needs to be reprocessed.

Question: Is reprocessing a crucial issue. I mean performance-wise? The workflow could be something like:

check if waveform metadata is available
if not, then generate the waveform metadata

Note, that this change requires adjusting the collector's delete facilities, too.

Calculate the metrics anyways, but apply filtering on the web service side

I'm not sure if I would implement this approach. Imagine a request which queries the entire waveform metadata inventory. Then, filtering becomes costly. I'm aware that there are service level configuration parameters such as https://github.com/EIDA/wfcatalog/blob/93f3d7bc4a1219f12da0bb4ae1ad45920a99fbcf/service/configuration.json#L12-L14 and https://github.com/EIDA/wfcatalog/blob/93f3d7bc4a1219f12da0bb4ae1ad45920a99fbcf/service/configuration.json#L19 available.

damb commented 3 years ago

@jbienkowski, have you seen already this https://github.com/EIDA/wfcatalog/blob/93f3d7bc4a1219f12da0bb4ae1ad45920a99fbcf/collector/config.json#L20-L22 configuration option which enables file based filtering while collecting?

Jollyfant commented 3 years ago

We originally decided not to include just the channels in the metadata because a lot of nodes wanted to have their full archive processed, and not just what is exposed through FDSNWS. I guess updating the white list is too much manual labor.. It is probably better to add another option and add the FDSNWS response [net.sta.loc.cha] to a hashmap and do a lookup on whether to skip or not.

jschaeff commented 3 years ago

I would be in favor of processing everything, and filtering the output. This is the same strategy as for data management.

This way, as soon as the metadata is available, wfcatalog can spit all the information out and there is no need to start looking for all the data to index each time some metadata is submitted . Of course, it should be implemented without making a fdsnws-station call for every wfcatalog request.

damb commented 3 years ago

Of course, it should be implemented without making a fdsnws-station call for every wfcatalog request.

@jschaeff, I get your points. However, this approach implies:

Modification of the currently used DB schema i.e. introducing a field in order to store the restrictedStatus StationXML attribute property. As a consequence, this property is stored redundantly.
Keeping track of changes of the introduced restrictedStatus property. Note that this property might change later, basically at any point in time.

jschaeff commented 3 years ago

I often hit the wall of ignoring when the metadata changes. We miss a datestamp on the stationXML format, that would be usefull in a lot of usecases. Or there could be an RSS feed service provided by all EIDA nodes and publishing metadata changes. Or a websocket system. But this is a bit off topic, although it would help wfcatalog keeping track of metadata changes.

Could wfcatalog manage a cache of the StationXML metadata for each network he knows about (or just the part he needs) ? The cache can be refreshed at arbitrary frequency or manualy.

damb commented 3 years ago

Basicaly, it's about storing a dictionary NSLC:boolean and the°+°

Unfortunately, that's not enough. The restrictedStatus references a ChannelEpoch such that the startDate and endDate needs to be part of the dict key. However, this epoch information might change, too. Besides, it is not strictly defined how the restrictedStatus attribute property is inherited to child nodes.

We miss a datestamp on the stationXML format, that would be usefull in a lot of usecases.

Versioning most probably requires more than just a simple time stamp.

Could wfcatalog manage a cache of the StationXML metadata for each network he knows about (or just the part he needs) ? The cache can be refreshed at arbitrary frequency or manualy.

OT: Interestingly, not caching StationXML metadata was a requirement when designing eidaws-federator. So, why should it be possible when implementing fdsnws-availability based on the eidaws-wfcatalog backend?

jschaeff commented 3 years ago

Basicaly, it's about storing a dictionary NSLC:boolean and the°+°

Unfortunately, that's not enough. The restrictedStatus references a ChannelEpoch such that the startDate and endDate needs to be part of the dict key. However, this epoch information might change, too. Besides, it is not strictly defined how the restrictedStatus attribute property is inherited to child nodes.

Sorry, github sent my comment with some keyboard shortcut I hit ... Yes the restriction is valid for a timeperiod. Good point.

EIDA / wfcatalog

WFCatalog metadata dependency #23