Fix Bug#131: 0 rows returns when server return 429 on first page of results.

austindonnelly commented 6 years ago

For background, read bug #131

This happens because base_execution_context._fetch_items_helper_with_retries() runs the requests wrapped in a retry_utility._Execute() call, which does retries. However, much lower down the call stack more retries will be done because synchronized_request.SynchronizedRequest() also uses retry_utility.Execute() to do retries.

These two nested retry policies cause the following problem: When the innermost policy decides there have been too many failures it re-raises the HTTPFailure exception (retry_utility.py:84) However, the outermost retry policy catches that (base_execution_context.py:132), and attempts another round of innermost retries. The first of these sees misleading request processing state: it has received 0 rows of results, and does not have a continuation token:

base_execution_context.py:116
fetched_items = []
while self._continuation or not self._has_started
    do the work
return fetched_items

In this case, self._continuation is None because this is the first page of the query and the server has not returned a continuation token. And self._has_started is True because of the first round of retries. So the while loop never executes, and it returns the empty array of fetched_items.

So the client (incorrectly) decides the query is complete and returns 0 rows. Bug.

This pull request:

adds a new test which repros this bug.
fixes the bug, by removing the outermost layer of retries.
increases the default level of retries, now that two retry policies are no longer nested.
fixes a test which was counting the number of requests issued, not that the two policies are no longer nested.

It is safe to remove the outermost layer of retries, because the library code seems to have evolved as follows. Initially, only the outermost layer of retries existed. Then support for queries failing over to another geo-location was added, by putting retries down at the bottom level (in synchronized_request, see Commit 9d1bac3.) But at this time, the uppermost layer of retries was not removed. So now the code does nested retries. Only the innermost layer is actually needed.

Also, notice that nested retries is another bug. It means that when user code specifies retry options, they apply to both inner and outer retry policies, and get compounded together. We have tests which cover this (test/retry_policy_tests.py), but they mock out the retry utility's _Execute function, so by-pass the nesting, which makes the tests pass despite the fact that when it runs for real it won't do what the user expect. Eg, a retry limit of 8 calls will turn into 8*8 = 64!

Users may have inadvertently been relying on this very large level of retries, which is why the default retry level has been increased. Not to 64, but double (8 to 16).

Testing: all existing tests pass, and the newly added test passes with the fix, and fails without the fix. Additionally, we've been running this against a live CosmosDB server and it's working nicely for us.

christopheranderson commented 5 years ago

Manually started CI on this PR: https://cosmos-db-sdk-public.visualstudio.com/cosmos-db-sdk-public/_build/results?buildId=1282

christopheranderson commented 5 years ago

Failed with this error:

================================== FAILURES ===================================
_______________ Test_retry_policy_tests.test_429_on_first_page ________________

self = <test.retry_policy_tests.Test_retry_policy_tests testMethod=test_429_on_first_page>

    def test_429_on_first_page(self):
>       client = document_client.DocumentClient(Test_retry_policy_tests.host, {'masterKey': Test_retry_policy_tests.masterKey})
E       NameError: global name 'document_client' is not defined

D:\a\1\s\test\retry_policy_tests.py:254: NameError

HyeonGook commented 4 years ago

Any updates?

Azure / azure-cosmos-python

Fix Bug#131: 0 rows returns when server return 429 on first page of results. #132