Potential threading bug around use of the "patch" method

ethagnawl commented 1 week ago

I have read the docs thoroughly before making this bug report.

[X] Yes

I have read through other open issues, and my issue is not a duplicate.

[X] Yes

What version of the library you are using?

0.10.3

Which python version are you using?

3.12

What operating system are you on?

Ubuntu

Description

I have run into an edge case which may result from this project's use of patch in a way that's not threadsafe. It's also totally possible that this error masking or the result of an underlying application error.

This occurs with some amount of predictability when using the Django development web server and spamming the server with many requests in parallel. It varies by environment but when running in Docker on MacOS, it's ~3 and in Docker on Debian/Ubuntu, it's ~9. I realize this will vary widely depending on the project, its resource usage, etc. but I share these details in order to provide as much useful context as possible.

I am not able to reproduce this error when using the Django development web server with the --nothreading flag or when using one or more Gunicorn workers running synchronously.

Here's the unhandled exception:

api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/query_optimizer/compiler.py", line 62, in optimize_single
api_1  |     fetch_in_context(queryset)
api_1  |     │                └ <QuerySet [...]
api_1  |     └ <function fetch_in_context at 0x7fde16708040>
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/query_optimizer/prefetch_hack.py", line 94, in fetch_in_context
api_1  |     with fetch_context():
api_1  |          └ <function fetch_context at 0x7fde16707f60>
api_1  |   File "/usr/local/lib/python3.12/contextlib.py", line 158, in __exit__
api_1  |     self.gen.throw(value)
api_1  |     │    │   │     └ StopIteration()
api_1  |     │    │   └ <method 'throw' of 'generator' objects>
api_1  |     │    └ <generator object fetch_context at 0x7fde14cf05f0>
api_1  |     └ <contextlib._GeneratorContextManager object at 0x7fde0c21d8b0>
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/query_optimizer/prefetch_hack.py", line 86, in fetch_context
api_1  |     with _HACK_CONTEXT if _PREFETCH_HACK_CACHE else nullcontext():
api_1  |          │                │                         └ <class 'contextlib.nullcontext'>
api_1  |          │                └ <WeakKeyDictionary at 0x7fde166a9610>
api_1  |          └ <unittest.mock._patch object at 0x7fde166a9640>
api_1  |   File "/usr/local/lib/python3.12/unittest/mock.py", line 1594, in __exit__
api_1  |     if self.is_local and self.temp_original is not DEFAULT:
api_1  |        │                 │                         └ sentinel.DEFAULT
api_1  |        │                 └ <unittest.mock._patch object at 0x7fde166a9640>
api_1  |        └ <unittest.mock._patch object at 0x7fde166a9640>
api_1  | 
api_1  | AttributeError: '_patch' object has no attribute 'is_local'. Did you mean: 'has_local'?

That error was surfaced in detail by removing the error massaging from here and just re-raising the exception. If that error is "handled" by the graphql library, it eventually bubbles up to the client with an inscrutable error which looks like:

{
  "errors": [
    {
      "message": "'_patch' object has no attribute 'is_local'",
      "locations": [
        {
          "line": 101,
          "column": 3
        }
      ],
      "path": [
        "some_obj"
      ]
    }
  ],
  "data": {
    "some_obj": null
  }
}

I will attempt to reproduce this using the example project in this repo later today and follow up.

Other potentially relevant environment details:

Python dependencies

Django==4.2.11
graphene-django==3.2.0
graphene-django-query-optimizer==0.10.3
graphql-core==3.2.5
graphql-relay==3.2.0

Please let me know if there's any additional information I can provide.

MrThearMan commented 1 week ago

Hi, thanks for the issue

The likely cause of this is the fact that the same unittest.mock._patch object _HACK_CONTEXT is being reused as a context manager on top of each itself. This leads to a race condition where is_local has been deleted from the patch object by the inner patch (see patch.__exit__) while the outer patch still tries to access it (also in patch.__exit__).

From unittest.mock._patch (parts omitted for brevity):

def __exit__(self, *exc_info):
    if self.is_local and self.temp_original is not DEFAULT:  # Access here
    ...
    del self.is_local  # deleted from the object here
    ...

I think this can be fixed simply by creating a new patch object for each context. I released this fix in 0.10.4.

However, I don't know if this fully fixes the issue due to how query_optimizer.prefetch_hack.fetch_context is currently implemented. It clears _PREFETCH_HACK_CACHE when exiting the context, meaning the other context that requires it might not be optimized correctly. The prefetch hack is only required when using nested connection fields for many-to-many related fields, and this issue would only happen if multiple of those are queried in the same query.

Can you verify if this issue still applies in your context after the fix in 0.10.4? I'll investigate the latter issue and post an update later.

MrThearMan commented 6 days ago

I wasn't able to replicate this in my testing. I created the scenario I described where we have a nested many-to-many connection field within another nested many-to-many connection field, but this didn't lead to any errors with the shared patch object. Added the tests to the repo still, maybe they'll catch a bug in the future.

ethagnawl commented 6 days ago

Thanks for the prompt response and release, @MrThearMan!

I'm in the process of testing this now. Interestingly, I no longer see that exact error message but I do see another and I think it's (inscrutably) referencing the same line number as before. That may be a coincidence or it may be how the graphql library is constructing the errors included in the response.

{"errors":[{"message":"An unknown error occurred.","locations":[{"line":101,"column":3}],"path":["some_obj"]}],"data":{"some_obj":null}}

I'm going to make the same change I made earlier to the graphql library and see if it reveals anything new or different. I will follow up shortly.

ethagnawl commented 6 days ago

After some more digging into graphl-core, I've determined that the following StopIteration exception is being hit here and the exception contains an empty message, which explains that inscrutable (empty) error referenced above.

api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/graphene_django/debug/middleware.py", line 65, in resolve
api_1  |     result = next(root, info, **args)
api_1  |              ^^^^^^^^^^^^^^^^^^^^^^^^
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/graphene/relay/node.py", line 106, in node_resolver
api_1  |     return cls.get_node_from_global_id(info, id, only_type=only_type)
api_1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/graphene/relay/node.py", line 131, in get_node_from_global_id
api_1  |     return get_node(info, _id)
api_1  |            ^^^^^^^^^^^^^^^^^^^
api_1  |   File "/app/my_app/schemas/foos.py", line 112, in get_node
api_1  |     return get_authorized_node(cls, info, id)
api_1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
api_1  |   File "/app/my_app/schemas/__init__.py", line 809, in get_authorized_node
api_1  |     maybe_instance = optimize_single(
api_1  |                      ^^^^^^^^^^^^^^^^
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/query_optimizer/compiler.py", line 62, in optimize_single
api_1  |     fetch_in_context(queryset)
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/query_optimizer/prefetch_hack.py", line 96, in fetch_in_context
api_1  |     return list(queryset)  # the database query is executed here
api_1  |            ^^^^^^^^^^^^^^
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/django/db/models/query.py", line 398, in __iter__
api_1  |     self._fetch_all()
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/django/db/models/query.py", line 1883, in _fetch_all
api_1  |     self._prefetch_related_objects()
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/django/db/models/query.py", line 1273, in _prefetch_related_objects
api_1  |     prefetch_related_objects(self._result_cache, *self._prefetch_related_lookups)
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/django/db/models/query.py", line 2321, in prefetch_related_objects
api_1  |     obj_list, additional_lookups = prefetch_one_level(
api_1  |                                    ^^^^^^^^^^^^^^^^^^^
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/django/db/models/query.py", line 2463, in prefetch_one_level
api_1  |     ) = prefetcher.get_prefetch_queryset(instances, lookup.get_current_queryset(level))
api_1  |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/django/db/models/fields/related_descriptors.py", line 740, in get_prefetch_queryset
api_1  |     queryset = _filter_prefetch_queryset(queryset, self.field.name, instances)
api_1  |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
api_1  |   File "/usr/local/lib/python3.12/unittest/mock.py", line 1139, in __call__
api_1  |     return self._mock_call(*args, **kwargs)
api_1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
api_1  |   File "/usr/local/lib/python3.12/unittest/mock.py", line 1143, in _mock_call
api_1  |     return self._execute_mock_call(*args, **kwargs)
api_1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
api_1  |   File "/usr/local/lib/python3.12/unittest/mock.py", line 1204, in _execute_mock_call
api_1  |     result = effect(*args, **kwargs)
api_1  |              ^^^^^^^^^^^^^^^^^^^^^^^
api_1  |   File "/home/appuser/.local/lib/python3.12/site-packages/query_optimizer/prefetch_hack.py", line 65, in _prefetch_hack
api_1  |     cache: _PrefetchCacheType = next(iter(_PREFETCH_HACK_CACHE.values()))
api_1  |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
api_1  | StopIteration

I don't have any thoughts as to the why of this just yet ...

MrThearMan commented 4 days ago

I made another attempt at fixing this in 0.10.5. This fix binds the required table relations for the prefetch hack in the context itself, so it shouldn't run into the same race condition where _PREFETCH_HACK_CACHE has been cleared by one context while another tries to access it.

ethagnawl commented 2 days ago

Thanks so much, @MrThearMan. At first blush, 0.10.5 does seem to solve the problem. I was just able to verify that a script spawning 24 processes which spam the server (Debian/Docker) does not result in the same behavior that I was seeing consistently prior to the latest release.

I will ask my colleagues who were seeing this issue on macOS to verify today.

MrThearMan / graphene-django-query-optimizer