Improving federated /fdsnws/station performance

@javiquinte,

I'd like to raise once more the question which strategy should be implemented in order to increase the overall performance of /fdsnws/station/1/query?format=xml&level=channel|response queries at eida-federator. Let me first describe which strategy is implemented at the time being:

In order to provide explicit routing, harvesting is performed by means of /eidaws/routing/1/localconfig routing configurations. Furthermore, routes defined are resolved by means of /fdsnws/station/1/query?format=xml&level=channel. This information is used for explicit routing purposes exclusively. Thus, it is not used to serve station metadata client requests to eida-federator. Instead, this (cached) routing information is used to
- resolve wildcard routes in order to
- send granular requests to /fdsnws/station webservices of EIDA DCs, afterwards.
To improve both eida-federator's performance and to prevent the corresponding /fdsnws/station webservices of EIDA DCs from overloading, the current version uses a HTTP caching reverse proxy approach (This backend cache is currently implemented by means of Apache2 + mod_cache_disk.). Note also, that this requires fdsnws-station metadata from EIDA DCs to be requested by means of the HTTP GET method.

As a consequence, fdsnws-station metadata is only requested from EIDA DCs if the granular request could not be served from the backend cache. @kaestli mentioned the impact of this approach on logging during the EIDA meeting in De Bilt (2020-Q1).

So the question arises, why eida-federator does not use the station metadata already gathered while harvesting its routing information? Instead of using level=channel one simply could immediately request station metadata with level=response and demultiplex the StationXML data in order to fill a cache for fdsnws-station requests. Well, that's right if there wouldn't be the following issues:

Ideally cached data should be accessed within constant time i.e. O(1). Hence, how to store the data in order to have O(1) access? Immediately something like a hashmap comes to my mind. Which attributes should be used to compute keys if requests with query filter parameters specified by the FDSNWS Specification 1.2 must be supported? It should be clear, that cache access within constant time is only very difficult to achieve if feasible at all. Instead, one could store inventory data within a DB. If indices are set properly, then data is accessible within O(log N) (when neglecting DB caching).
Caching StationXML metadata within a DB requires eida-federator to a) fully parse StationXML metadata (which could be done by means of e.g. obspy, since filling the cache is not time critical) and b) maintain the complete EIDA inventory. However eida-federator was developed and designed to provide a gateway for fdsnws and eidaws webservices. Maintaining a centralized inventory at eida-federator level disagrees with the fact that every single EIDA DC is responsible for its own inventory.
Also, (optional) query filter parameters such as matchtimeseries would add additional complexity when maintaining a central inventory without the corresponding miniseed data available.

From my point of view, maintaining a central inventory at eida-federator level is only very difficult to achieve and contradicts the original design goals. However, I'm open for a discussion. If there is a simple way around I'm happy to implement the suggestions.

Thanks in advance for any kind of comments.

EIDA / mediatorws

Improving federated /fdsnws/station performance #112