Resending requests in batches >= 1000

matjazp commented 5 years ago

I've been testing a new version of the backend. There seems to be a problem with sending more than 999 images.

Sending 1000 images throws an exception and the last 100+ reqs in a last "minibatch" (up to http2 max_concurrent_streams, usually 128) is sent again (seen in logs as retry=1).

$ python3 -u embedders_test_single.py --url $URL:443 --embedder inception-v3 --time 1 --batch 1000
Test will run for: 1 seconds
Script will test inception-v3
Script will use generated images
[2019-09-14 22:38:51.545519] Start testing inception-v3 with 1000 images
Maximum number of http2 requests through a single connection exceeded
Traceback (most recent call last):
  File "/home/primoz/miniconda3/envs/test/lib/python3.7/site-packages/h2/connection.py", line 241, in process_input
    func, target_state = self._transitions[(self.state, input_)]
KeyError: (<ConnectionState.CLOSED: 3>, <ConnectionInputs.RECV_WINDOW_UPDATE: 13>)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/primoz/miniconda3/envs/test/lib/python3.7/site-packages/orangecontrib/imageanalytics/http2_client.py", line 138, in _get_json_response_or_none
    response_raw = self._server_connection.get_response(stream_id)
  File "/home/primoz/miniconda3/envs/test/lib/python3.7/site-packages/hypertemp/http20/connection.py", line 305, in get_response
    return HTTP20Response(stream.getheaders(), stream)
  File "/home/primoz/miniconda3/envs/test/lib/python3.7/site-packages/hypertemp/http20/stream.py", line 240, in getheaders
    self._recv_cb(stream_id=self.stream_id)
  File "/home/primoz/miniconda3/envs/test/lib/python3.7/site-packages/hypertemp/http20/connection.py", line 795, in _recv_cb
    self._single_read()
  File "/home/primoz/miniconda3/envs/test/lib/python3.7/site-packages/hypertemp/http20/connection.py", line 685, in _single_read
    events = conn.receive_data(data)
  File "/home/primoz/miniconda3/envs/test/lib/python3.7/site-packages/h2/connection.py", line 1531, in receive_data
    events.extend(self._receive_frame(frame))
  File "/home/primoz/miniconda3/envs/test/lib/python3.7/site-packages/h2/connection.py", line 1554, in _receive_frame
    frames, events = self._frame_dispatch_table[frame.__class__](frame)
  File "/home/primoz/miniconda3/envs/test/lib/python3.7/site-packages/h2/connection.py", line 1764, in _receive_window_update_frame
    ConnectionInputs.RECV_WINDOW_UPDATE
  File "/home/primoz/miniconda3/envs/test/lib/python3.7/site-packages/h2/connection.py", line 246, in process_input
    "Invalid input %s in state %s" % (input_, old_state)
h2.exceptions.ProtocolError: Invalid input ConnectionInputs.RECV_WINDOW_UPDATE in state ConnectionState.CLOSED
[2019-09-14 22:39:29.376631] 1000 images processed in 37.83101487159729 seconds.

Everything is OK when batch is 999:

$ python3 -u embedders_test_single.py --url $URL:443 --embedder inception-v3 --time 1 --batch 999
Test will run for: 1 seconds
Script will test inception-v3
Script will use generated images
[2019-09-14 22:53:29.781097] Start testing inception-v3 with 999 images
[2019-09-14 22:54:07.626622] 999 images processed in 37.8454225063324 seconds.

But even if I send 2 or even 4 parallel connections with 999 each, everything is OK.

$ time parallel -j2 -N0  python3 -u embedders_test_single.py --url $URL:443 --embedder inception-v3 --time 1 --batch 999 ::: {1..2}
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

Test will run for: 1 seconds
Script will test inception-v3
Script will use generated images
[2019-09-14 22:59:43.619251] Start testing inception-v3 with 999 images
[2019-09-14 23:00:35.665483] 999 images processed in 52.04614186286926 seconds.
Test will run for: 1 seconds
Script will test inception-v3
Script will use generated images
[2019-09-14 22:59:43.621728] Start testing inception-v3 with 999 images
[2019-09-14 23:00:35.641074] 999 images processed in 52.019267320632935 seconds.

real    1m1.927s
user    0m51.652s
sys 0m5.364s
$ time parallel -j4 -N0  python3 -u embedders_test_single.py --url $URL:443 --embedder inception-v3 --time 1 --batch 999 ::: {1..4}
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

Test will run for: 1 seconds
Script will test inception-v3
Script will use generated images
[2019-09-14 22:57:06.532381] Start testing inception-v3 with 999 images
[2019-09-14 22:58:22.888127] 999 images processed in 76.35568571090698 seconds.
Test will run for: 1 seconds
Script will test inception-v3
Script will use generated images
[2019-09-14 22:57:06.588048] Start testing inception-v3 with 999 images
[2019-09-14 22:58:23.005568] 999 images processed in 76.41748428344727 seconds.
Test will run for: 1 seconds
Script will test inception-v3
Script will use generated images
[2019-09-14 22:57:06.368641] Start testing inception-v3 with 999 images
[2019-09-14 22:58:26.293262] 999 images processed in 79.9246015548706 seconds.
Test will run for: 1 seconds
Script will test inception-v3
Script will use generated images
[2019-09-14 22:57:06.431489] Start testing inception-v3 with 999 images
[2019-09-14 22:58:26.296569] 999 images processed in 79.86498737335205 seconds.

real    1m29.739s
user    1m45.811s
sys 0m9.065s

A quick fix appears to be to send pics to the server in batches of no more than 999 in one connection and/or make parallel connections. We should also replace Hyper (see #146).

PrimozGodec commented 5 years ago

@matjazp it is a good finding. I knew it is happening but I didn't know that it is so predictive (every 1000 requests). I am still waiting for httpx parallel requests to be implemented https://github.com/encode/httpx/pull/52 before a switch to httpx. If it is not gonna happen soon I implement automatic reconnect after 999 images.

matjazp commented 5 years ago

If this is a new feature for httpx, we should test it thoroughly before using it in production. Low hanging fruit is to just send pics in batches of 999. This should be fairly simple change, yes? I vote for implementation as soon as possible, as now this always sends extra load over network...

matjazp commented 5 years ago

also look at the debate at https://github.com/encode/httpx/issues/258

biolab / orange3-imageanalytics

Resending requests in batches >= 1000 #148