Cache /fdsnws/station requests

damb commented 4 years ago

Features and Changes:

Use Apache2's mod_cache_disk in order to cache /fdsnws/station HTTP GET requests
Cache for up to 24 hours (12 hours + 12 hours)
Cache control headers are configured by the eida-federator WSGI application

Disadvantages of this approach:

Works for HTTP GET requests only. I.e. HTTP POST requests are not cached at all.
Proxies may force circumventing the cache

The FDSNWS specs 1.2 allow for query parameters to be unordered. Clients requesting data once with e.g.

curl -v "http://localhost:8080/fdsnws/station/1/query?net=CH&sta=*&level=station&format=xml&start=2019-01-01"

and

curl -v "http://localhost:8080/fdsnws/station/1/query?net=CH&sta=*&level=station&start=2019-01-01&format=xml"

or even

curl -v "http://localhost:8080/fdsnws/station/1/query?net=CH,GR&sta=*&level=station&start=2019-01-01&format=xml"

and

curl -v "http://localhost:8080/fdsnws/station/1/query?net=GR,CH&sta=*&level=station&start=2019-01-01&format=xml"

can force cache misses. Also aliases are not taken into consideration (e.g. queries with net and network are treated differently).

The approach does not implement a distributed cache which may be shared by several Apache2 instances. Though, Apache2 does provide mod_cache_socache for distributed caching using e.g. a Memcached backend.

As a consequence of the disadvantages listed above, the application IMO should handle the cache internally. Note, that the current docker production setup comes along with a redis server anyway which could be used for this purpose.

damb commented 4 years ago

CC @kaestli

damb commented 4 years ago

@kaestli, I deployed the feature at mediator-devel.ethz.ch. If you'd like you can give it a try.

damb commented 4 years ago

References: #50

kaestli commented 4 years ago

observation:

nonix:~$ date; curl 'http://mediator-devel.ethz.ch/fdsnws/station/1/query?network=*&station=*&location=*&channel=HHZ,HHE&start=2019-03-01&end=2019-03-03&level=response&format=xml' > /tmp/bla.xml; date
Thu Dec 12 14:07:48 CET 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 58.9M    0 58.9M    0     0   177k      0 --:--:--  0:05:40 --:--:-- 1984k
Thu Dec 12 14:13:28 CET 2019
nonix:~$ date; curl 'http://mediator-devel.ethz.ch/fdsnws/station/1/query?network=*&station=*&location=*&channel=HHZ,HHE&start=2019-03-01&end=2019-03-03&level=response&format=xml' > /tmp/bla.xml; date
Thu Dec 12 14:14:50 CET 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 58.9M    0 58.9M    0     0   487k      0 --:--:--  0:02:03 --:--:-- 2824k
Thu Dec 12 14:16:54 CET 2019
nonix:

from response times of an identically repeated request I guess that backend caching is working, but frontend cache is not.

Note: special care is required to avoid multiple cache versions (in the frontend cache) for different sets of request headers - ask @cbonjour for details. (in this case, I would recommend to disregard even accept-encoding, and return all data uncompressed (no mod_deflate) (as station information is little, wfcatalog is rare, and dataselect is precompressed)

damb commented 4 years ago

Hmm. I restarted the Frontend - Apache from (mediator-devel.ethz.ch) and now it's working, again. This is weird. Apparently, the configuration is not stable, yet.

First request:

$ time curl -v -o /dev/null 'http://mediator-devel.ethz.ch/fdsnws/station/1/query?network=*&station=*&location=*&channel=HHZ,HHE&start=2019-03-01&end=2019-03-03&level=station&format=text'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 129.132.144.211...
* TCP_NODELAY set
* Connected to mediator-devel.ethz.ch (129.132.144.211) port 80 (#0)
> GET /fdsnws/station/1/query?network=*&station=*&location=*&channel=HHZ,HHE&start=2019-03-01&end=2019-03-03&level=station&format=text HTTP/1.1
> Host: mediator-devel.ethz.ch
> User-Agent: curl/7.58.0
> Accept: */*
> 
  0     0    0     0    0     0      0      0 --:--:--  0:00:24 --:--:--     0< HTTP/1.1 200 OK
< Date: Thu, 12 Dec 2019 15:33:17 GMT
< Server: Apache/2.4.18 (Ubuntu)
< Cache-Control: public, max-age=43200
< Access-Control-Allow-Origin: *
< Vary: Accept-Encoding
< X-Cache: MISS from localhost
< X-Cache-Detail: "cache miss: attempting entity save" from localhost
< Transfer-Encoding: chunked
< Content-Type: text/plain; charset=utf-8
< 
{ [342 bytes data]
100  125k    0  125k    0     0   2393      0 --:--:--  0:00:53 --:--:--  5032
* Connection #0 to host mediator-devel.ethz.ch left intact

real    0m53.650s
user    0m0.044s
sys 0m0.043s

Second request:

$ time curl -v -o /dev/null 'http://mediator-devel.ethz.ch/fdsnws/station/1/query?network=*&station=*&location=*&channel=HHZ,HHE&start=2019-03-01&end=2019-03-03&level=station&format=text'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 129.132.144.211...
* TCP_NODELAY set
* Connected to mediator-devel.ethz.ch (129.132.144.211) port 80 (#0)
> GET /fdsnws/station/1/query?network=*&station=*&location=*&channel=HHZ,HHE&start=2019-03-01&end=2019-03-03&level=station&format=text HTTP/1.1
> Host: mediator-devel.ethz.ch
> User-Agent: curl/7.58.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Thu, 12 Dec 2019 15:34:51 GMT
< Server: Apache/2.4.18 (Ubuntu)
< Vary: Accept-Encoding
< Cache-Control: public, max-age=43200
< Access-Control-Allow-Origin: *
< Age: 93
< X-Cache: HIT from localhost
< X-Cache-Detail: "cache hit" from localhost
< Content-Length: 128332
< Content-Type: text/plain; charset=utf-8
< 
{ [14152 bytes data]
100  125k  100  125k    0     0  9640k      0 --:--:-- --:--:-- --:--:-- 9640k
* Connection #0 to host mediator-devel.ethz.ch left intact

real    0m0.040s
user    0m0.015s
sys 0m0.018s

However, due to the disadvantages mentioned above I still favor a distributed cache handled by the WSGI application itself. @cbonjour shares the same view after discussing the issue.

kaestli commented 4 years ago

However, due to the disadvantages mentioned above I still favor a distributed cache handled by the WSGI application itself. @cbonjour shares the same view after discussing the issue.

i disagree on this. we can discuss tomorrow...

damb commented 4 years ago

eida-federator is implemented such that endpoint requests to DCs are not executed anymore if a client terminates the connection while streaming the response. This fact leads to an interesting behaviour when trying to cache by means of Apache2's mod_cache.

Assuming a client issues the request:

$ curl -v -o - "http://mediator-devel.ethz.ch/fdsnws/station/1/query?net=CH,GR,AW&format=xml"

but terminates the connection right after the net=GR was served (the <Network></Network> tags for net=CH and net=AW are still missing). The headers (HTTP code 200) are gone since the service is able to serve a valid response, however, the content was not served completely, yet. Also, mod_cache is not aware of the full scenario. Though, when executing the request from above a second time, the request turns out to lead to a cache hit and the data already served during the first go is returned again. However, in case of format=xml the cached content consequently does not agree with StationXML1.0.

damb commented 4 years ago

Closed due to the unpredictable behaviour mentioned before.

EIDA / mediatorws

Cache /fdsnws/station requests #92