Closed nbalfour closed 4 years ago
My implementation of dataselect is currently in one of my branches called s3_dataselect in my fdsn fork. I've only tested it against a relatively small S3 bucket but would suggest the following queries should be investigated. I'm using Curl to specifically test the GET and POST downloads. Obspy uses GET and a restricted set of features.
This is a specific GET query (net/sta/loc/cha) but a large time range. This would involve reading a large number of files from S3 but a relatively small 'file listing'. This should create a large miniseed output file but the server should handle it without any issues.
curl "<server-host>/fdsnws/dataselect/1/query?net=NZ&sta=CHST&loc=01&cha=LOG&start=2001-01-01T00:00:00&end=2017-01-09T23:00:00" -o test.mseed
This is a GET query that spans many different networks or stations (multiple comma delimited stations). This is probably the worse case scenario, due to the way the files are named in S3 and the 'prefix' required to list them. This would require listing and filtering all filenames from the bucket regardless of locations/channels/time range. I would expect a lot of time to be spent in the ListObjectsV2() call which can only list 1000 keys at a time. Listing thousands or millions of files will be time consuming. Solutions and alternatives are mentioned below.
curl "<server-host>/fdsnws/dataselect/1/query?net=NZ&sta=ALRZ,CHST&loc=01&cha=LOG&start=2017-01-01T00:00:00&end=2017-01-09T23:00:00" -o test.mseed
This is a large POST request that has many queries. Each query is evaluated independently on the server and the output is merged. A large number of specific queries may be quicker than the comma-delimited query mentioned above. This type of query may result in a very large output file but the dataselect service should be able to handle many of these requests concurrently:
curl -v --data-binary @post_input.txt http://<service-host>/fdsnws/dataselect/1/query -o test_post.mseed
The contents of the input file post_input.txt would be in the following format (according to the FDSN spec):
NZ ALRZ 10 EHN 2017-01-09T00:00:00 2017-01-09T02:00:00
NZ ALRZ * EHN 2017-01-09T00:00:00 2017-05-28T02:00:00
These queries should be a good starting point for finding any weak or strong points in the implementation.
Debugging these downloads An error message will be printed out in plain text if an error occurs. The output file will contain this error. It should make it obvious what happened (eg: "Too many files in request:11190, max: 3000")
Workarounds:
The large comma-delimited GET request for multiple stations could be implemented as several queries using the POST approach. These use a more specific 'prefix' so shouldn't require such a large file listing.
It is common practice to have a key index for S3 (eg: simpleDB and Lambda). Instead of listing keys you could query this db for all matching keys. If this is identified as a bottleneck and a higher performance query is required we could implement this.
@nbalfour do you know if there are any outstanding issues regarding the FDSN service, or can we close this old issue
@sue-h-gns I think this is being worked on elsewhere and is captured in this ticket https://github.com/GeoNet/tickets/issues/5138 So, in my opinion we should close this ticket.
As a tester I want to run appropriate tests of the dataselect service So that I can make sure the service is fulfilling the requirements of our end users
Acceptance criteria: