EarthScope / rover

ROVER: robust data access tool for FDSN data centers
https://earthscope.github.io/rover/
Other
10 stars 1 forks source link

Timeout for large request #39

Closed timronan closed 6 years ago

timronan commented 6 years ago

Rover retrieve/subscribe should likely force every entry within a request file to query the data availability service, based on N_S_L_C starttime, before attempting to download data. If the service returns a 404 error, at minimum a warning should be given to the user. Likely, the unavailable entry should be flagged, reported, and ignored from the Rover retrieve/subscribe operations. Request for unavailable files make large request time out.

rover retrieve request.txt
retrieve  DEFAULT: Rover version 0.0.0 - starting retrieve
retrieve  DEFAULT: Status available at http://127.0.0.1:8000
retrieve  DEFAULT: Trying new retrieval (attempt 1 of 3)
retrieve CRITICAL: HTTPConnectionPool(host='service.iris.edu', port=80): Read timed out.
retrieve  DEFAULT: See "rover help retrieve"

Also, middle request file entries seeking unavailable data likely cause Rover to stop before the request has been completed. The order of request file entries affects how Rover processes the request file and collects data. Rover should not be allowed to subscribe to unavailable streams, but it should be allowed to subscribe to a future date of preexisting streams. Subscribing to future dates seems to justify the functionality of rover subscribe.

This ticket attempts to follow similar pre-validation logic which is outlined in Issue #11.

chad-earthscope commented 6 years ago

I'm pretty sure rover is requesting availability data before ever doing a data retrieval.

Is it possible that the timeout above is for the availability request itself?

What is the retrive.txt file that caused this result?

FWIW, a data request (fdsnws-dataselect) will not generally time out for a request that does return any data. The case of no matching data is quickly resolved and a 204 (or 404) is returned to the client. The cases where a max-24-hour data request to fdsnws-dataselect will timeout before any data is returned are few and far between.

andrewcooke-isti commented 6 years ago

can i have more info on the error here? there were some fixes last week that addressed non-200 return codes for web services.

timronan commented 6 years ago

The request.txt file is included in this comment. It is a massive request, and I do think that the timeout is occurring in the availability service. I executed an availability service query with the first two rows of the request.txt file and the availability service timed out. So how do we make request for very large data sets? It seems conceivable that a client would want to generate a very large local repository of all stations from one or multiple networks. Would it be better to run multiple rover retrieve/subscribe request.txt instances simultaneously? If so, see issue #30. request.txt

timronan commented 6 years ago

After retesting the case :

Middle request file entries seeking unavailable data likely causes Rover to stop before the request has been completed. The order of request file entries affects how Rover processes the request file and collects data.

The hypothesis above is likely untrue. Request that trigger 404 errors seem to be transparent to Rover and are skipped. The earlier comment is likely associated with data availability time outs for very large request. The file request_order.txt, attached, is used to test and disprove the hypothesis quoted in this comment, above.

request_order.txt

chad-earthscope commented 6 years ago

I've changed the title of this issue to be more appropriate.

The request.txt file is included in this comment. It is a massive request and I do think that the timeout is occurring in the availability service.

I appears something is inconsistent occurring. Testing the request like this:

curl --verbose -o result.txt --data-binary @request.txt 'http://service.iris.edu/irisws/availability/1/query?mergequality=true&mergesamplerate=true'

shows that the service begins to return data almost immediately, i.e. no timeout should be triggered. The result is truly large in volume (at least 100's of megabytes, I haven't let it finish). More diagnostics are needed.

@timronan how did you test to get this timeout?

timronan commented 6 years ago

I received the timeout message

retrieve  DEFAULT: Rover version 0.0.0 - starting retrieve
retrieve  DEFAULT: Status available at http://127.0.0.1:8000
retrieve  DEFAULT: Trying new retrieval (attempt 1 of 3)
retrieve CRITICAL: HTTPConnectionPool(host='service.iris.edu', port=80): Read timed out.

by running rover retrieve request.txt.

After duplicating the test:

curl --verbose -o result.txt --data-binary @request.txt 'http://service.iris.edu/irisws/availability/1/query?mergequality=true&mergesamplerate=true'

I received the error message.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 128.95.166.47...
* TCP_NODELAY set
* Connected to service.iris.edu (128.95.166.47) port 80 (#0)
> POST /irisws/availability/1/query?mergequality=true&mergesamplerate=true HTTP/1.1
> Host: service.iris.edu
> User-Agent: curl/7.54.0
> Accept: */*
> Content-Length: 392
> Content-Type: application/x-www-form-urlencoded
> 
} [392 bytes data]
* upload completely sent off: 392 out of 392 bytes
< HTTP/1.1 200 OK
< Server: Apache-Coyote/1.1
< access-control-allow-origin: *
< content-disposition: inline; filename="irisws-availability_2018-09-10T20:54:06Z.txt"
< Content-Type: text/plain
< Transfer-Encoding: chunked
< Vary: Accept-Encoding
< Date: Mon, 10 Sep 2018 20:54:05 GMT
< Connection: close
< 
{ [6 bytes data]
100 1352M    0 1352M    0   392  1876k      0 --:--:--  0:12:18 --:--:--     0* Recv failure: Operation timed out
* stopped the pause stream!
100 1352M    0 1352M    0   392  1875k      0 --:--:--  0:12:18 --:--:--     0
* Closing connection 0
curl: (56) Recv failure: Operation timed out

I received a similar result to:

The service begins to return data almost immediately.

but after allowing the operation to finish processing I received the timeout error message described above.

I attempted to verify the timeout failure message by running the request

https://service.iris.edu/irisws/availability/1/query?format=text&net=IU,II,TT,UW,RE,UU,ER,AM&sta=*&loc=*&cha=*&starttime=2015-01-01T00:00:00&endtime=2020-01-01T00:00:00&mergequality=false&mergesamplerate=false&mergeoverlap=false&showlastupdate=false&excludetoolarge=true&nodata=404

over IRIS's availability web services.

The availability web service immediately returns information to the browser, but the browser slows down and seems to quit processing the request. I am not sure if a timeout message will eventually be generated from the web services. My browser always freezes in during the middle of the request, so it is difficult to allow the process to complete.

chad-earthscope commented 6 years ago

The availability web service immediately returns information to the browser, but the browser slows down and seems to quit processing the request. I am not sure if a timeout message will eventually be generated from the web services. My browser always freezes in during the middle of the request, so it is difficult to allow the process to complete.

If you mean a desktop web browser that's because it is receiving a huge amount of data that it is going to try and display, this is normal and why we use curl/wget for such testing.

Pedantically: any timeout message is not generated from web services themselves, what you show above is a message from the TCP/IP stack and related system functions, it's a networking error.

At them moment, based on your curl test above, this is looking like an error in the servicing of the request independent of rover.

chad-earthscope commented 6 years ago

I just tried the curl and got the whole thing, 2.8 GB.

$ curl --verbose -o result.txt --data-binary @request.txt 'http://service.iris.edu/irisws/availability/1/query?mergequality=true&mergesamplerate=true'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 128.95.166.47...
* TCP_NODELAY set
* Connected to service.iris.edu (128.95.166.47) port 80 (#0)
> POST /irisws/availability/1/query?mergequality=true&mergesamplerate=true HTTP/1.1
> Host: service.iris.edu
> User-Agent: curl/7.54.0
> Accept: */*
> Content-Length: 392
> Content-Type: application/x-www-form-urlencoded
> 
} [392 bytes data]
* upload completely sent off: 392 out of 392 bytes
< HTTP/1.1 200 OK
< Server: Apache-Coyote/1.1
< access-control-allow-origin: *
< content-disposition: inline; filename="irisws-availability_2018-09-12T16:38:26Z.txt"
< Content-Type: text/plain
< Transfer-Encoding: chunked
< Vary: Accept-Encoding
< Date: Wed, 12 Sep 2018 16:38:26 GMT
< Connection: close
< 
{ [4344 bytes data]
100 2881M    0 2881M  100   392  7880k      1  0:06:32  0:06:14  0:00:18 8512k

Hhhmm, requires more testing.

timronan commented 6 years ago

I retested and also received an entire curl request. Maybe the DMC internet connection went down during the first test.

tronan:datarepo_largesubscribe tronan$ curl --verbose -o result.txt --data-binary @request.txt 'http://service.iris.edu/irisws/availability/1/query?mergequality=true&mergesamplerate=true'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 128.95.166.47...
* TCP_NODELAY set
* Connected to service.iris.edu (128.95.166.47) port 80 (#0)
> POST /irisws/availability/1/query?mergequality=true&mergesamplerate=true HTTP/1.1
> Host: service.iris.edu
> User-Agent: curl/7.54.0
> Accept: */*
> Content-Length: 392
> Content-Type: application/x-www-form-urlencoded
> 
} [392 bytes data]
* upload completely sent off: 392 out of 392 bytes
< HTTP/1.1 200 OK
< Server: Apache-Coyote/1.1
< access-control-allow-origin: *
< content-disposition: inline; filename="irisws-availability_2018-09-13T22:58:02Z.txt"
< Content-Type: text/plain
< Transfer-Encoding: chunked
< Vary: Accept-Encoding
< Date: Thu, 13 Sep 2018 22:58:02 GMT
< Connection: close
< 
{ [1448 bytes data]
100 2884M    0 2884M    0   392  7475k      0 --:--:--  0:06:35 --:--:-- 7852k

After receiving entire curl request, I ran rover rover retrieve request.txt and was returned the read time out error.

tronan:Rover tronan$ rover retrieve request.txt 
retrieve  DEFAULT: Rover version 0.0.0 - starting retrieve
retrieve  DEFAULT: Status available at http://127.0.0.1:8000
retrieve  DEFAULT: Trying new retrieval (attempt 1 of 3)
retrieve CRITICAL: HTTPConnectionPool(host='service.iris.edu', port=80): Read timed out.
retrieve  DEFAULT: See "rover help retrieve"

This result has been verified multiple times.

timronan commented 6 years ago

A ReadTimeoutError exception is being thrown by request.models.iter_content leading to the timeout. I suspect this timeout is occurring because of a socket read timeout exception in the connection between Rover and the IRIS availability service. Rover will successfully connect to the availability service and begin to pull chunks of data specified by the rover retrieve request.txt, the data is then stored in a .txt file in the tmp directory. If a chunk of data is not pulled from the availability service within the amount of seconds defined in the read timeout exception, than the connection between rover and the availability service closes [Following the urllib3/response._original_response.close() protocol]. After this connection closes the http-timeout defined in rover.config begins its count. The request to the availability service must be completed for rover to work.

I am struggling to find the code that assigns the socket read timeout scalar value in seconds/None. It seems like this value could be changed to prevent the connection between rover and the availability service from being ended prematurely. I have tried changing the socket timeout values to None or a very large number in future.backports.socket.py, urllib3.response.py, urllib3.connection.py and am still receiving the timeout error.

I have also tried changing the size of each data chunk that is pulled over the HTTP request. I have tried changing the chunk sizes to very large numbers in request.models.iter_content, rover.utils._stream_output, and urllib3/response.read_chunked.

Are there any ideas on how to turn the read timeout exception off or change the socket read timeout scalar value to a large number/None so the timeout exception will not be triggered? Or could we increase the size of each data chunk being pulled across the connection. It seems like the longer the connection between Rover and IRIS:Availability is open the less likely the request will succeed especially during times of heavy server usage. [The request does randomly succeed sometimes, and always stops in random lines within the request.]

I added a stack trace into the code at the read timeout error exception. It is included in this issue.

*** print_tb:
  File "/Users/tronan/tmp_rover/rover/env3/lib/python3.7/site-packages/requests/models.py", line 751, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/Users/tronan/tmp_rover/rover/env3/lib/python3.7/site-packages/urllib3/response.py", line 461, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/Users/tronan/tmp_rover/rover/env3/lib/python3.7/site-packages/urllib3/response.py", line 665, in read_chunked
    self._original_response.close()
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/tronan/tmp_rover/rover/env3/lib/python3.7/site-packages/urllib3/response.py", line 336, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
*** print_exception:
Traceback (most recent call last):
  File "/Users/tronan/tmp_rover/rover/env3/lib/python3.7/site-packages/urllib3/response.py", line 331, in _error_catcher
    yield
  File "/Users/tronan/tmp_rover/rover/env3/lib/python3.7/site-packages/urllib3/response.py", line 640, in read_chunked
    chunk = self._handle_chunk(amt)
socket.timeout: timed out
andrewcooke-isti commented 6 years ago

the download is handled by the requests library. i'm pretty sure i exposed the timeout and the retries parameters. http-timeout and http-retries in settings. those are passed to the library in get_to_file in utils.py.

(the reason i used requests is that it worked with python 2 and 3, while the "raw" interface had changed and, iirc, the backports solution gave errors (i can't remember the details, but there was some issue with using just the base python libs). it's also a pretty common library with a good rep...)

timronan commented 6 years ago

It seems that request._session().post() does not robustly interface with the IRIS availability service. We have altered the settings in rover.utils.post_to_file [Line 258] and it's lower level functions and still receive a timeout error for availability request approx. larger than 1 GB. We have also changed the heap space on the availibilty services to allow for large request to be held in server memory without a satisfactory result. We know the availability service is not the failure point because the same request are successfully returned by curl and wget.

Would it make sense to replace the line that query's the availability service rover.utils.post_to_file [Line 257-259]

     with open(up, 'rb') as input:
         request = _session(retries).post(url, stream=True, data=input, timeout=timeout)
     return _stream_output(request, down, unique=unique)

with a wget or curl command?

Similar to

request  = os.system(wget_input)
wget_input = "wget --post-file=%s -O %s %s"%(up, down, url)

The rover/tmp/rover_availability_response_XXX and rover/tmp/wget_file seem to return the same information formatted in the same way.

head rover/tmp/rover_availability_response_XXX
#n s     l  c                      earliest                      latest
PA BRU2  -- HHE 2017-01-01T00:00:00.000000Z 2017-01-01T00:02:01.990000Z
PA BRU2  -- HHE 2017-01-01T00:03:36.140000Z 2017-01-06T05:15:27.990000Z
PA BRU2  -- HHE 2017-01-06T05:17:19.000000Z 2017-01-08T20:25:15.200000Z
PA BRU2  -- HHE 2017-01-08T21:55:45.880000Z 2017-01-12T15:53:46.990000Z
PA BRU2  -- HHE 2017-01-12T15:55:38.000000Z 2017-01-13T14:00:42.990000Z
PA BRU2  -- HHE 2017-01-13T14:02:34.000000Z 2017-01-21T18:27:21.990000Z
PA BRU2  -- HHE 2017-01-21T18:29:13.000000Z 2017-01-22T12:46:05.990000Z
PA BRU2  -- HHE 2017-01-22T12:47:57.000000Z 2017-01-22T23:16:32.990000Z
PA BRU2  -- HHE 2017-01-22T23:18:24.000000Z 2017-01-28T23:52:32.990000Z
head rover/tmp/rover_availability_response_58c216_6839 
#n s     l  c                      earliest                      latest
PA BRU2  -- HHE 2017-01-01T00:00:00.000000Z 2017-01-01T00:02:01.990000Z
PA BRU2  -- HHE 2017-01-01T00:03:36.140000Z 2017-01-06T05:15:27.990000Z
PA BRU2  -- HHE 2017-01-06T05:17:19.000000Z 2017-01-08T20:25:15.200000Z
PA BRU2  -- HHE 2017-01-08T21:55:45.880000Z 2017-01-12T15:53:46.990000Z
PA BRU2  -- HHE 2017-01-12T15:55:38.000000Z 2017-01-13T14:00:42.990000Z
PA BRU2  -- HHE 2017-01-13T14:02:34.000000Z 2017-01-21T18:27:21.990000Z
PA BRU2  -- HHE 2017-01-21T18:29:13.000000Z 2017-01-22T12:46:05.990000Z
PA BRU2  -- HHE 2017-01-22T12:47:57.000000Z 2017-01-22T23:16:32.990000Z
PA BRU2  -- HHE 2017-01-22T23:18:24.000000Z 2017-01-28T23:52:32.990000Z

The downsides to this solution is that rover could be less portable. I'm not positive if this will work on windows machine. It might be possible to get a wget.exe file (OS dependent) and include it into the Rover package.

How is the data select service is called? I cannot find it in rover/manager.py or rover/utils.py. Does it use a post file?

chad-earthscope commented 6 years ago

Would it make sense to replace the line that query's the availability service rover.utils.post_to_file ... with a wget or curl command?

Let's not add those as dependencies. If those tools can retrieve the information robustly and repeatedly then so can Python.

timronan commented 6 years ago

Timeout only occurs over wireless internet connections. Request.py will return a responses of any size if the internet is stable and fast enough. It would be an interesting experiment to see if the availability service can complete returning data over various connections including a wired connection with less bandwidth and wireless connections from various wifi sources.

The documentation could be updated to to inform users of the importance of using a wired internet connection if they plan on downloading a large request. Although this seems obvious, it seems to be easily overlooked.

timronan commented 6 years ago

This was associated with the internet connection, and is not associated with Rover. Furthermore, Rover has a configurable timeout, which both connection and read timeouts can be configured.