fed, stl: Station requests resolved to stream level

Jollyfant commented 6 years ago

Station requests are resolved to the stream level and perhaps this should not.

e.g.

http://federator-testing.ethz.ch/fdsnws/station/1/query?net=NL&level=channel&format=text

will take a long time. Imagine doing this for full EIDA.

javiquinte commented 6 years ago

Maybe I'm not understanding the problem. In the query you specified "level=channel", so you requested to have it to stream level. Isn't it?

damb commented 6 years ago

Hi @Jollyfant,

The approach we currently have chosen allows fully granular logging i.e. transparency and traceability. AFAIK the SC3 fdsnws-station implementation (Twisted) should be able to handle multiple concurrent requests easily. However, I don't know how this scales with queries on the DB (inventory) level. Maybe SC3 experts can serve with input here.

cheers

Jollyfant commented 6 years ago

Yeah, but it looks like my end point is hit once per channel. While you could probably batch the request by just querying for each station or network even

damb commented 6 years ago

Proposal on bulk requests for fdsnws-station metadata requests:

We group streams to be requested regarding network code such that we are able to bulk request station metadata from endpoints based on network code granularity.
Distributed physical networks still make use of granular requests.

At endpoints this approach should reduce the number of arriving concurrent station metadata requests dramatically.

For level=channel|response explicit stream epochs will be send to endpoints.

Jollyfant commented 6 years ago

Distributed physical networks still make use of granular requests.

On the channel or station level? Channels may be distributed but in practise are never.

damb commented 6 years ago

On the channel or station level?

For all levels i.e. level=network|station|channel|response. Since distributed physical networks in general are sparse, IMO this approach should be feasible. Also this approach guarantees that we are able to properly merge StationXML one day (at least as soon as EIDA defines rules how to synchronize fdsnws-station response output formats).

Channels may be distributed but in practice are never.

Still EIDA does not define standards regarding the practice. As long as there are no standards I have to expect and properly handle this case. I opt for the most general approach.

BTW: Currently this proceeding only affects fdsnws-station-xml since for fdsnws-station-text we do not implement synchronization facilities at the time being. AFAIK merging fdsnws-station-text will be only relevant for level=network.

Jollyfant commented 6 years ago

Still EIDA does not define standards regarding the practice. As long as there are no standards I have to expect and properly handle this case. I opt for the most general approach.

I know and have to agree with you. But we should probably define this more clearly since it makes absolutely no sense to have channels that come from the same instrument at other data centers.

BTW: Currently this proceeding only affects fdsnws-station-xml since for fdsnws-station-text we do not implement synchronization facilities at the time being. AFAIK merging fdsnws-station-text will be only relevant for level=network.

I don't fully understand this yet but maybe I will after you implement this.

damb commented 6 years ago

Hi @javiquinte,

with 4ca57e74f6aa9af73012260c57122e7177c4af72 I implemented the proposal from above i.e. sending for station metadata bulk requests to endpoints with network code granularity. However, e.g. the fdsnws-station service at GFZ does not cope with such bulk requests. Using this bulk approach always seems to lead to either

504 Server Error: Gateway Time-out for url: http://geofon.gfz-potsdam.de/fdsnws/station/1/query

or even HTTP status code 502 (Bad Gateway).

Imagine a request as follows:

fdsnws/station/1/query?net=ZS&level=channel&minlat=45&maxlat=47&minlon=10&maxlon=14

This will match almost every station from network net=ZS. From the eida-federator perspective this leads to the following endpoint request (see attachment).

Unfortunately using a wildcard based routing is no option since not all stations from net=ZS match the geographic rectangle I've chosen.

Attachments: ZS.txt

javiquinte commented 6 years ago

Still EIDA does not define standards regarding the practice. As long as there are no standards I have to expect and properly handle this case. I opt for the most general approach.

I know and have to agree with you. But we should probably define this more clearly since it makes absolutely no sense to have channels that come from the same instrument at other data centers.

Well, until not so long ago we had the case of ETH and ODC (what a coincidence!) exposing BHZ channels from some stations in one data centre and the other two components in the other one.

Jollyfant commented 6 years ago

Well, until not so long ago we had the case of ETH and ODC (what a coincidence!) exposing BHZ channels from some stations in one data centre and the other two components in the other one.

So let's be clever boys and agree not do to things like this.. unless someone can come up with a real argument as to why this is required.

jfclinton commented 6 years ago

The ETH/ODC example is a legacy issue. Actually, ETH collected and archived HH? channels from the Swiss networks, did no decimation so there were no alternative sampling rates. At the time, the HH data was exchanged with ODC, and as ODC did not archive highest sample rates then, they decimated to to BH? channels and archived these only. Thus we ended up with different sampling rates of the same data authoritative at different networks. This has since been cleaned up - the VEBSN is gone, and ETH now builds and archives HH?, BH? and LH?.

Unfortunately, though, we still do not have agreed sets of channels across the EIDA archives. I can imagine if other networks join EIDA, these issues could arise again in differences between ODC and these archives, unless ODC agrees immediately to let the other network become authoritative over all time.

andres-h commented 6 years ago

Using this bulk approach always seems to lead to either 504 Server Error: Gateway Time-out or even HTTP status code 502 (Bad Gateway).

WS is just slow. Error 504 (Gateway Time-out) means that WS did not answer during 60s, which is the default proxy timeout. When testing with variable number of lines, it can be seen that it works as long as response is within 60 seconds. Timeout can be increased, but not indefinitely... This is an easy way to do a DOS attack :(

Error 502 (Bad Gateway) typically means that there is no free connection slot at the moment (we allow 2 connections per IP and 20 connections total). Client should pause and try again.

damb commented 6 years ago

Hi @andres-h,

thanks for your comments.

WS is just slow. Error 504 (Gateway Time-out) means that WS did not answer during 60s, which is the default proxy timeout. When testing with variable number of lines, it can be seen that it works as long as response is within 60 seconds. Timeout can be increased, but not indefinitely... This is an easy way to do a DOS attack :(

Is it a WS issue or rather a backend (SC3 backend) issue? Twisted should be able to handle easily hundreds of requests. How exactly are you accessing data by the service? I suppose you use the Datamodel from the SC3 core (Python Iface was generated by means of SWIG). How often are you actually accessing the DB when processing an incoming stream epoch?

Error 502 (Bad Gateway) typically means that there is no free connection slot at the moment (we allow 2 connections per IP and 20 connections total). Client should pause and try again.

@andres-h, do you think this is appropriate if a highly performant system is required? Think about the client's perspective using eida-federator. We discussed this issue with @javiquinte. IMO we should find a solution for this issue.

andres-h commented 6 years ago

Twisted should be able to handle easily hundreds of requests.

Yes, if you just send data and don't do anything CPU-intensive to generate the data in the first place...

How exactly are you accessing data by the service? I suppose you use the Datamodel from the SC3 core (Python Iface was generated by means of SWIG). How often are you actually accessing the DB when processing an incoming stream epoch?

DB is not accessed at all. The code iterates over Datamodel inventory (in RAM) and tries to match any elements against request lines. Matched elements are copied to new inventory, which is converted to FDSNXML and sent to client at the end.

Since there is only one network in the request, other networks are immediately skipped and not inspected, but it is still slow. In principle some further optimizations are possible.

@andres-h, do you think this is appropriate if a highly performant system is required? Think about the client's perspective using eida-federator. We discussed this issue with @javiquinte. IMO we should find a solution for this issue.

I don't see a real solution. We can increase the limits, but not infinitely.

I think it is better to wait for a slot and then download data fast, rather than allow huge number of slow connections and overload the storage system. We must also run wfmetadata collector and other things. If we overload the storage, we risk that wfmetadata collector cannot do its job anymore, for example.

Moreover, faster downloads are less likely to break. As you know, the FDSN protocol does not have functionality to resume broken downloads.

damb commented 6 years ago

DB is not accessed at all. The code iterates over Datamodel inventory (in RAM) and tries to match any elements against request lines. Matched elements are copied to new inventory, which is converted to FDSNXML and sent to client at the end.

Since there is only one network in the request, other networks are immediately skipped and not inspected, but it is still slow. In principle some further optimizations are possible.

We're talking here about this generator? I.e.

def networkIter(self, inv, matchTime=False):
    for i in xrange(inv.networkCount()):
        net = inv.network(i)

        for ro in self.streams:
            # network code
            if ro.channel and not ro.channel.matchNet(net.code()):
                continue

            # start and end time
            if matchTime and ro.time:
                try: end = net.end()
                except ValueError: end = None
                if not ro.time.match(net.start(), end):
                    continue

            yield net
            break

Some points I'm struggeling with:

Since in this generator function you're simply looping over your networks I assume your in-memory inventory has no tree-like data structure. Hence your search is linear. In the second loop you loop over all streams requested i.e. once again linear -> your network lookup is O(n^2). Once you have a network you do the same for stations (see stationIter generator), locations and channels/streams. For level=channel|response this approach finally leads to this quadruple-loop lookup.
Even if you would store your inventory within a tree-like data structure. You basically would reimplement DB algorithms. So, for what reason you are loading your entire inventory into RAM? Speed? DB's usually use Binary-Tree like data structures optimized for lookups. Lookup is O(log n). Hence, much faster.
The RESIF implementation seems to be able to respond within ~ 50s when requesting their entire inventory with explicit stream epochs (~ 16000 stream epochs) (see attachment). I don't think that RESIF caches this type of request (They don't cause you can delete any line within this file and perform another request. Response time still is about the same.). Instead they seem to access their DB directly. In the GFZ-example above the request contained just a single network and a single request to GFZ.
The issue originally was announced to eida-federator. I admit that there may be lots of space to improve. However, a single network code (assume the network is located at one datacenter) leads to a single endpoint request (no concurrent requests). In such a case eida-federator never serves data faster than the endpoint itself. Hence, it is an endpoint issue rather than a eida-federator issue.

I don't see a real solution. We can increase the limits, but not infinitely.

@andres-h, there is a real solution. But it's more fundamental. Access inventory data from the DB directly.

Attachment: resif-all-explicit.txt

andres-h commented 6 years ago

So, for what reason you are loading your entire inventory into RAM? Speed?

First of all, the original author is @gempa-stephan, not me. I guess we don't have efficient functions to convert FDSNWS request to SQL and then convert the result to SC3 Datamodel.

DB's usually use Binary-Tree like data structures optimized for lookups. Lookup is O(log n).

Assuming you don't use wildcards. But DB is faster anyway, because it is not written in Python. Another way to speed things up would be implementing the search in C++ and using SWIG.

The RESIF implementation seems to be able to respond within ~ 50s when requesting their entire inventory with explicit stream epochs (~ 16000 stream epochs) (see attachment). I don't think that RESIF caches this type of request (They don't cause you can delete any line within this file and perform another request. Response time still is about the same.). Instead they seem to access their DB directly.

RESIF doesn't use SC3 implementation. Their database is specifically optimized for FDSNWS. The code is AFAIK written in Java, not Python. And you don't use wildcards.

@andres-h, there is a real solution.

I meant there is no solution for avoiding error 502 completely (eg., allowing unlimited number of connections). For sure there is a lot of room for optimizations.

damb commented 6 years ago

First of all, the original author is @gempa-stephan, not me.

Sorry, @andres-h, I didn't want to blame you. Let's say the author - whoever s/he was ;).

RESIF doesn't use SC3 implementation. Their database is specifically optimized for FDSNWS. The code is AFAIK written in Java, not Python.

If I remember right, they have a Python implementation. During the last ETC meeting at Grenoble somebody from RESIF mentioned Flask. I used the RESIF example since they are the only ones within EIDA not using a SC3 webservice implementation. BTW: In any of the examples I'm using wildcards.

I meant there is no solution for avoiding error 502 completely (eg., allowing unlimited number of connections).

Got it. :+1:

damb commented 4 years ago

Thanks @Jollyfant,

BTW, do you still have the same email-address?

damb commented 4 years ago

Closed with #83. Configurable requests strategies are available now. Additional strategies might be implemented if required.

Jollyfant commented 4 years ago

Thanks @Jollyfant,

BTW, do you still have the same email-address?

Yep!

EIDA / mediatorws

fed, stl: Station requests resolved to stream level #32