aws / aws-xray-sdk-python

AWS X-Ray SDK for the Python programming language
Apache License 2.0
329 stars 143 forks source link

Unhandled AttributeError if reservoir is None #222

Open michael-k opened 4 years ago

michael-k commented 4 years ago

Calling reservoir.borrow_or_take(now, rule.can_borrow) leads to an AttributeError if reservoir is None. See screenshot (sentry) for details.

image

Looks like a race condition caused by merging rules. We saw this in production, but only once.

https://github.com/aws/aws-xray-sdk-python/blob/2d12c83383ea42e955ad530a7c18832c7881fb32/aws_xray_sdk/core/sampling/rule_cache.py#L47-L50

https://github.com/aws/aws-xray-sdk-python/blob/2d12c83383ea42e955ad530a7c18832c7881fb32/aws_xray_sdk/core/sampling/sampling_rule.py#L71-L80

srprash commented 4 years ago

Hi @michael-k Thanks for bringing this to our notice. Will look into the issue shortly.

srprash commented 4 years ago

Hi @michael-k Not sure if this would happen due to a race condition since the merge operation uses lock while setting the reservoir to None. I'm still investigating this.

How often do you see this error? Can you provide some logs as well?

Thanks

michael-k commented 4 years ago

Not sure if this would happen due to a race condition since the merge operation uses lock while setting the reservoir to None.

I'm thinking more about this scenario rather something happening while the merge is in process:

But I didn't really look into to code.

How often do you see this error?

We've seen it once in the last 30 days. I cannot give you precise numbers for overall due to the retention policy of our plan at sentry. But I'd say it didn't happen more than every other month.

Can you provide some logs as well?

What kind of logs are you thinking about? I don't think we have more than the stack trace, but can try to collect some in the future if that helps.

srprash commented 4 years ago

We've seen it once in the last 30 days. I cannot give you precise numbers for overall due to the retention policy of our plan at sentry. But I'd say it didn't happen more than every other month.

I see. I'll try to simulate the scenario and see if it runs into an issue, however a single error in a month is pretty low be conclusive.

What kind of logs are you thinking about? I don't think we have more than the stack trace, but can try to collect some in the future if that helps.

You can configure SDK logs to debug level (although it would increase your log size quite a bit before we see this error). Also, FWIW, please provide the stack trace.

michael-k commented 4 years ago

please provide the stack trace

It's already in the initial comment. Here's all the information I have:

xray-attribute-error

alanc10n commented 3 years ago

I've seen this twice in the last 24 hours in an aiohttp app running python 3.6.12, xray SDK version 2.6.0 with following debug logs including stack trace:

myapp_logs i-0412be596c885e575 2021-02-23 12:43:20,512 DEBUG http://127.0.0.1:2000 "POST /GetSamplingRules HTTP/1.1" 200 1758 myapp_logs i-0412be596c885e575 2021-02-23 12:43:20,513 DEBUG Rule General is selected to make a sampling decision. myapp_logs i-0412be596c885e575 2021-02-23 12:43:20,514 ERROR Error handling request Traceback (most recent call last): File "/opt/myapp/env/local/lib64/python3.6/site-packages/aiohttp/web_protocol.py", line 418, in start resp = await task File "/opt/myapp/env/local/lib64/python3.6/site-packages/aiohttp/web_app.py", line 458, in _handle resp = await handler(request) File "/opt/myapp/env/local/lib64/python3.6/site-packages/aiohttp/web_middlewares.py", line 119, in impl return await handler(request) File "/opt/myapp/env/local/lib/python3.6/site-packages/aws_xray_sdk/ext/aiohttp/middleware.py", line 34, in middleware sampling_req=sampling_req, File "/opt/myapp/env/local/lib/python3.6/site-packages/aws_xray_sdk/ext/util.py", line 65, in calculate_sampling_decision decision = recorder.sampler.should_trace(sampling_req) File "/opt/myapp/env/local/lib/python3.6/site-packages/aws_xray_sdk/core/sampling/sampler.py", line 72, in should_trace return self._process_matched_rule(matched_rule, now) File "/opt/myapp/env/local/lib/python3.6/site-packages/aws_xray_sdk/core/sampling/sampler.py", line 104, in _process_matched_rule decision = reservoir.borrow_or_take(now, rule.can_borrow) AttributeError: 'NoneType' object has no attribute 'borrow_or_take'

srprash commented 3 years ago

Hey @alanc10n Apologies for the delayed response. I'm trying to repeoduce the issue with using aiohttp but haven't had any success yet. I'm not sure why and when the _reservoir for a sampling rule is not initialized and is None. Are you using centralized (defined in X-Ray console) or local sampling rules in your application? Have you encountered this error since?

I'll continue to take a look into it and post an update. Thanks.

alanc10n commented 3 years ago

I'm using centralized sampling rules; there are several defined in the console including "General", the rule that appears in the logs immediately prior to the error. Thanks.

autumnjolitz commented 3 years ago

I see this issue in my Sentry as well. It can fault/kill an entire HTTP request or celery worker because of an unhandled AttributeError.

I am using defaults (centralized sampling rules, only the default catch-all rule for x-ray).

srprash commented 3 years ago

Hi @alanc10n @autumnjolitz , Is it possible for you to provide a simple repro code which can consistently reproduce this issue? I am trying to experiment with some configurations to reproduce the issue by isolating the threading logic in the SDK but so far have been unable to. Thanks.

alanc10n commented 3 years ago

@srprash I'll try to get a simple case that can reproduce this, but given how infrequently it happens I'm not optimistic. I'll post an update here if it works.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in next 7 days. Thank you for your contributions.

autumnjolitz commented 2 years ago

Still seeing this issue. Happens about once a day, sometimes twice. Are you sure you've got your rule merging code down pat? I suspect you're setting the reservoir to None before setting it to something (or a property is acting to that effect). Python can switch threads between op-codes, so the only safe action is either to protect a thread-shared variable with a lock or only do a single assignment between a fully initialized reservoir and a new fully initialized one (one op).

srprash commented 2 years ago

Hi @autumnjolitz Thanks for the suggestions. I haven't been able to reproduce this issue but looking again at the RuleCache class's get_matched_rule and _load_rules methods, I have a theory which indicated a race condition causing this issue.

The issue may be happening when the rule matching and the rule refreshing are coinciding and happening at the same time. Let's consider a request came in and sampling rule foo from the current RuleCache was matched to sample the request. While at the same time the RulePoller kicks off the periodic fetch and merge of sampling rules from the X-Ray backend, and during this merge operation updates the RuleCache with new SamplingRule objects and sets the old rule's (Foo in this case) reservoir to None. However, the rule matching logic has still retained the reference to the old rule's object whose reservoir now has been made None. And this causes the AttributeError.

I don't yet know the reason behind setting the old rule's reservoir to None. Maybe I can start off with removing that line. Also, I think the self._rules property of the RuleCache should be protected using a lock.

To confirm this theory, do you think you'll be able to modify the X-Ray SDK code to put some debug messages and test with you application?

michael-k commented 2 years ago

To confirm this theory, do you think you'll be able to modify the X-Ray SDK code to put some debug messages and test with you application?

We're seeing this about once a week in production. If you provide a commit with those debug messages we can run them.

michael-k commented 1 year ago

↑ @srprash If you provide those debug messages we can run them.