Open michael-k opened 4 years ago
Hi @michael-k Thanks for bringing this to our notice. Will look into the issue shortly.
Hi @michael-k Not sure if this would happen due to a race condition since the merge operation uses lock while setting the reservoir to None. I'm still investigating this.
How often do you see this error? Can you provide some logs as well?
Thanks
Not sure if this would happen due to a race condition since the merge operation uses lock while setting the reservoir to None.
I'm thinking more about this scenario rather something happening while the merge is in process:
But I didn't really look into to code.
How often do you see this error?
We've seen it once in the last 30 days. I cannot give you precise numbers for overall due to the retention policy of our plan at sentry. But I'd say it didn't happen more than every other month.
Can you provide some logs as well?
What kind of logs are you thinking about? I don't think we have more than the stack trace, but can try to collect some in the future if that helps.
We've seen it once in the last 30 days. I cannot give you precise numbers for overall due to the retention policy of our plan at sentry. But I'd say it didn't happen more than every other month.
I see. I'll try to simulate the scenario and see if it runs into an issue, however a single error in a month is pretty low be conclusive.
What kind of logs are you thinking about? I don't think we have more than the stack trace, but can try to collect some in the future if that helps.
You can configure SDK logs to debug level (although it would increase your log size quite a bit before we see this error). Also, FWIW, please provide the stack trace.
please provide the stack trace
It's already in the initial comment. Here's all the information I have:
I've seen this twice in the last 24 hours in an aiohttp app running python 3.6.12, xray SDK version 2.6.0 with following debug logs including stack trace:
myapp_logs i-0412be596c885e575 2021-02-23 12:43:20,512 DEBUG http://127.0.0.1:2000 "POST /GetSamplingRules HTTP/1.1" 200 1758 myapp_logs i-0412be596c885e575 2021-02-23 12:43:20,513 DEBUG Rule General is selected to make a sampling decision. myapp_logs i-0412be596c885e575 2021-02-23 12:43:20,514 ERROR Error handling request Traceback (most recent call last): File "/opt/myapp/env/local/lib64/python3.6/site-packages/aiohttp/web_protocol.py", line 418, in start resp = await task File "/opt/myapp/env/local/lib64/python3.6/site-packages/aiohttp/web_app.py", line 458, in _handle resp = await handler(request) File "/opt/myapp/env/local/lib64/python3.6/site-packages/aiohttp/web_middlewares.py", line 119, in impl return await handler(request) File "/opt/myapp/env/local/lib/python3.6/site-packages/aws_xray_sdk/ext/aiohttp/middleware.py", line 34, in middleware sampling_req=sampling_req, File "/opt/myapp/env/local/lib/python3.6/site-packages/aws_xray_sdk/ext/util.py", line 65, in calculate_sampling_decision decision = recorder.sampler.should_trace(sampling_req) File "/opt/myapp/env/local/lib/python3.6/site-packages/aws_xray_sdk/core/sampling/sampler.py", line 72, in should_trace return self._process_matched_rule(matched_rule, now) File "/opt/myapp/env/local/lib/python3.6/site-packages/aws_xray_sdk/core/sampling/sampler.py", line 104, in _process_matched_rule decision = reservoir.borrow_or_take(now, rule.can_borrow) AttributeError: 'NoneType' object has no attribute 'borrow_or_take'
Hey @alanc10n
Apologies for the delayed response. I'm trying to repeoduce the issue with using aiohttp but haven't had any success yet. I'm not sure why and when the _reservoir
for a sampling rule is not initialized and is None.
Are you using centralized (defined in X-Ray console) or local sampling rules in your application? Have you encountered this error since?
I'll continue to take a look into it and post an update. Thanks.
I'm using centralized sampling rules; there are several defined in the console including "General", the rule that appears in the logs immediately prior to the error. Thanks.
I see this issue in my Sentry as well. It can fault/kill an entire HTTP request or celery worker because of an unhandled AttributeError.
I am using defaults (centralized sampling rules, only the default catch-all rule for x-ray).
Hi @alanc10n @autumnjolitz , Is it possible for you to provide a simple repro code which can consistently reproduce this issue? I am trying to experiment with some configurations to reproduce the issue by isolating the threading logic in the SDK but so far have been unable to. Thanks.
@srprash I'll try to get a simple case that can reproduce this, but given how infrequently it happens I'm not optimistic. I'll post an update here if it works.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in next 7 days. Thank you for your contributions.
Still seeing this issue. Happens about once a day, sometimes twice. Are you sure you've got your rule merging code down pat? I suspect you're setting the reservoir to None before setting it to something (or a property is acting to that effect). Python can switch threads between op-codes, so the only safe action is either to protect a thread-shared variable with a lock or only do a single assignment between a fully initialized reservoir and a new fully initialized one (one op).
Hi @autumnjolitz
Thanks for the suggestions. I haven't been able to reproduce this issue but looking again at the RuleCache class's get_matched_rule
and _load_rules
methods, I have a theory which indicated a race condition causing this issue.
The issue may be happening when the rule matching and the rule refreshing are coinciding and happening at the same time. Let's consider a request came in and sampling rule foo
from the current RuleCache was matched to sample the request. While at the same time the RulePoller
kicks off the periodic fetch and merge of sampling rules from the X-Ray backend, and during this merge operation updates the RuleCache with new SamplingRule
objects and sets the old rule's (Foo
in this case) reservoir
to None
. However, the rule matching logic has still retained the reference to the old rule's object whose reservoir now has been made None
. And this causes the AttributeError
.
I don't yet know the reason behind setting the old rule's reservoir
to None
. Maybe I can start off with removing that line. Also, I think the self._rules
property of the RuleCache should be protected using a lock.
To confirm this theory, do you think you'll be able to modify the X-Ray SDK code to put some debug messages and test with you application?
To confirm this theory, do you think you'll be able to modify the X-Ray SDK code to put some debug messages and test with you application?
We're seeing this about once a week in production. If you provide a commit with those debug messages we can run them.
↑ @srprash If you provide those debug messages we can run them.
Calling
reservoir.borrow_or_take(now, rule.can_borrow)
leads to anAttributeError
ifreservoir is None
. See screenshot (sentry) for details.Looks like a race condition caused by merging rules. We saw this in production, but only once.
https://github.com/aws/aws-xray-sdk-python/blob/2d12c83383ea42e955ad530a7c18832c7881fb32/aws_xray_sdk/core/sampling/rule_cache.py#L47-L50
https://github.com/aws/aws-xray-sdk-python/blob/2d12c83383ea42e955ad530a7c18832c7881fb32/aws_xray_sdk/core/sampling/sampling_rule.py#L71-L80