EIDA / wfcatalog

EIDA NG WFCatalog implementation
5 stars 8 forks source link

WFCatalog metadata dependency #23

Open jbienkowski opened 3 years ago

jbienkowski commented 3 years ago

Currently, WFCatalog does not depend on station metadata - it calculates metrics for acquired data even if some channels are not defined in StationXML. In those cases users can retrieve the metrics, but are not able to download the data itself via FDSNWS-Dataselect web service which strongly depends on metadata.

Possible solutions:

  1. Exclude channels that are not defined in StationXML from WFCatalog collector processing
  2. Calculate the metrics anyways, but apply filtering on the web service side
  3. Cross-check on metadata in the downstream product (apparently the original approach)
  4. Add a metadata query parameter with default value true in the WFCatalog implementation which would still allow retrieval of all available metrics
damb commented 3 years ago

@jbienkowski, StationXML metadata validation may be optionally enabled/disabled. So it depends on the user of the software if there waveform catalog metadata is served or not. If the validation configuration changes, data needs to be reprocessed.

Question: Is reprocessing a crucial issue. I mean performance-wise? The workflow could be something like:

Note, that this change requires adjusting the collector's delete facilities, too.

Calculate the metrics anyways, but apply filtering on the web service side

I'm not sure if I would implement this approach. Imagine a request which queries the entire waveform metadata inventory. Then, filtering becomes costly. I'm aware that there are service level configuration parameters such as https://github.com/EIDA/wfcatalog/blob/93f3d7bc4a1219f12da0bb4ae1ad45920a99fbcf/service/configuration.json#L12-L14 and https://github.com/EIDA/wfcatalog/blob/93f3d7bc4a1219f12da0bb4ae1ad45920a99fbcf/service/configuration.json#L19 available.

damb commented 3 years ago

@jbienkowski, have you seen already this https://github.com/EIDA/wfcatalog/blob/93f3d7bc4a1219f12da0bb4ae1ad45920a99fbcf/collector/config.json#L20-L22 configuration option which enables file based filtering while collecting?

See also https://github.com/EIDA/wfcatalog/blob/93f3d7bc4a1219f12da0bb4ae1ad45920a99fbcf/collector/WFCatalogCollector.py#L475-L512 and https://github.com/EIDA/wfcatalog/blob/93f3d7bc4a1219f12da0bb4ae1ad45920a99fbcf/collector/WFCatalogCollector.py#L279-L299

Jollyfant commented 3 years ago

We originally decided not to include just the channels in the metadata because a lot of nodes wanted to have their full archive processed, and not just what is exposed through FDSNWS. I guess updating the white list is too much manual labor.. It is probably better to add another option and add the FDSNWS response [net.sta.loc.cha] to a hashmap and do a lookup on whether to skip or not.

jschaeff commented 3 years ago

I would be in favor of processing everything, and filtering the output. This is the same strategy as for data management.

This way, as soon as the metadata is available, wfcatalog can spit all the information out and there is no need to start looking for all the data to index each time some metadata is submitted . Of course, it should be implemented without making a fdsnws-station call for every wfcatalog request.

damb commented 3 years ago

Of course, it should be implemented without making a fdsnws-station call for every wfcatalog request.

@jschaeff, I get your points. However, this approach implies:

jschaeff commented 3 years ago

I often hit the wall of ignoring when the metadata changes. We miss a datestamp on the stationXML format, that would be usefull in a lot of usecases. Or there could be an RSS feed service provided by all EIDA nodes and publishing metadata changes. Or a websocket system. But this is a bit off topic, although it would help wfcatalog keeping track of metadata changes.

Could wfcatalog manage a cache of the StationXML metadata for each network he knows about (or just the part he needs) ? The cache can be refreshed at arbitrary frequency or manualy.

damb commented 3 years ago

Basicaly, it's about storing a dictionary NSLC:boolean and the°+°

Unfortunately, that's not enough. The restrictedStatus references a ChannelEpoch such that the startDate and endDate needs to be part of the dict key. However, this epoch information might change, too. Besides, it is not strictly defined how the restrictedStatus attribute property is inherited to child nodes.

We miss a datestamp on the stationXML format, that would be usefull in a lot of usecases.

Versioning most probably requires more than just a simple time stamp.

Could wfcatalog manage a cache of the StationXML metadata for each network he knows about (or just the part he needs) ? The cache can be refreshed at arbitrary frequency or manualy.

OT: Interestingly, not caching StationXML metadata was a requirement when designing eidaws-federator. So, why should it be possible when implementing fdsnws-availability based on the eidaws-wfcatalog backend?

jschaeff commented 3 years ago

Basicaly, it's about storing a dictionary NSLC:boolean and the°+°

Unfortunately, that's not enough. The restrictedStatus references a ChannelEpoch such that the startDate and endDate needs to be part of the dict key. However, this epoch information might change, too. Besides, it is not strictly defined how the restrictedStatus attribute property is inherited to child nodes.

Sorry, github sent my comment with some keyboard shortcut I hit ... Yes the restriction is valid for a timeperiod. Good point.