getsentry / sentry-python

The official Python SDK for Sentry.io
https://sentry.io/for/python/
MIT License
1.89k stars 498 forks source link

Sentry doesn't report errors in the traces_sampler function #2582

Open brettdh opened 10 months ago

brettdh commented 10 months ago

How do you use Sentry?

Sentry Saas (sentry.io)

Version

1.38.0

Steps to Reproduce

  1. Call sentry_sdk.init with a traces_sampler= kwarg of a function that raises an exception.
  2. Perform a request (in my case: HTTP request to Django app running in AWS Lambda)

Expected Result

Error in sampling function is reported to Sentry

Actual Result

Unhandled error occurs (found in logs), but no Sentry issue is created.

TypeError: expected string or bytes-like object

  File "/var/task/sentry_sdk/integrations/aws_lambda.py", line 151, in sentry_handler
    with hub.start_transaction(
  File "/var/task/sentry_sdk/hub.py", line 572, in start_transaction
    transaction._set_initial_sampling_decision(sampling_context=sampling_context)
  File "/var/task/sentry_sdk/tracing.py", line 841, in _set_initial_sampling_decision
    options["traces_sampler"](sampling_context)
  File "/var/task/myapp/config/settings/base.py", line 263, in my_traces_sampler
    if metrics_api_rx.search(path):

where path was None.

antonpirker commented 10 months ago

Hey @brettdh thanks for reporting this. We will have a look.

sl0thentr0py commented 10 months ago

by design, we do not report exceptions in SDK code, you should see it in logs if you adddebug=True

brettdh commented 10 months ago

@sl0thentr0py can you elaborate on "by design"? It seems like an observability gap; i.e. an exception in the sampling function can quietly cause all requests to fail.

Further, the sampling function was a bit hard to write, due to the context object format being largely undocumented, and this makes me all the more nervous that I may have missed some corner case - which is why I wanted to make sure I still have error reporting for the sampling function (which is my code, not SDK code).

sl0thentr0py commented 10 months ago

for one - if we self-report, we can end up in a busy loop of reporting. we can make the experience a bit better here and maybe use logger.warn or logger.error so that it's clear to users.

sl0thentr0py commented 10 months ago

ok actually nvm i just tried it out and I did get an issue in sentry

import sentry_sdk

def fail(_context):
    1 / 0

sentry_sdk.init(debug=True, traces_sampler=fail)

with sentry_sdk.start_transaction(name="traces sampler fail"):
    pass

sentry_sdk.flush()
image
brettdh commented 10 months ago

Hmm. I got bit by this again today - due to a real ZeroDivisionError, coincidentally - and I didn't notice until I dug into logs.

Oddly, I did see a seemingly-spurious ServerlessTimeoutWarning in Sentry, and jumping to the logs in Cloudwatch linked from that error in Sentry, that's how I found the ZeroDivisionError:

  | 2023-12-14T19:30:49.641Z | START RequestId: d5d126c4-8cf3-4ae3-b931-14a7167dbea4 Version: $LATEST
  | 2023-12-14T19:30:49.645Z | Exception in thread Thread-4169:
  | 2023-12-14T19:30:49.645Z | Traceback (most recent call last):
  | 2023-12-14T19:30:49.645Z |   File "/var/lang/lib/python3.9/threading.py", line 980, in _bootstrap_inner
  | 2023-12-14T19:30:49.645Z |     self.run()
  | 2023-12-14T19:30:49.645Z |   File "/var/task/sentry_sdk/integrations/threading.py", line 72, in run
  | 2023-12-14T19:30:49.645Z |     reraise(*_capture_exception())
  | 2023-12-14T19:30:49.645Z |   File "/var/task/sentry_sdk/_compat.py", line 115, in reraise
  | 2023-12-14T19:30:49.645Z |     raise value
  | 2023-12-14T19:30:49.645Z |   File "/var/task/sentry_sdk/integrations/threading.py", line 70, in run
  | 2023-12-14T19:30:49.645Z |     return old_run_func(self, *a, **kw)
  | 2023-12-14T19:30:49.645Z |   File "/var/task/sentry_sdk/utils.py", line 1392, in run
  | 2023-12-14T19:30:49.645Z |     raise ServerlessTimeoutWarning(
  | 2023-12-14T19:30:49.646Z | sentry_sdk.utils.ServerlessTimeoutWarning: WARNING : Function is expected to get timed out. Configured timeout duration = 29 seconds.
  | 2023-12-14T19:30:49.649Z | [ERROR] ZeroDivisionError: float division by zero 
  Traceback (most recent call last):
   File "/var/task/sentry_sdk/integrations/aws_lambda.py", line 151, in sentry_handler
     with hub.start_transaction(
   File "/var/task/sentry_sdk/hub.py", line 572, in start_transaction
     transaction._set_initial_sampling_decision(sampling_context=sampling_context)
   File "/var/task/sentry_sdk/tracing.py", line 841, in _set_initial_sampling_decision
     options["traces_sampler"](sampling_context)
   File "/var/task/myproj/config/settings/base.py", line 340, in traces_sampler
     rate = 1.0 / Host.objects.filter(up=True).count()
  | 2023-12-14T19:30:49.677Z | END RequestId: d5d126c4-8cf3-4ae3-b931-14a7167dbea4
  | 2023-12-14T19:30:49.677Z | REPORT RequestId: d5d126c4-8cf3-4ae3-b931-14a7167dbea4 Duration: 35.35 ms Billed Duration: 36 ms Memory Size: 1024 MB Max Memory Used: 248 MB

I say the error is spurious because, as you see in the logs, the lambda execution was only 36 ms, not 29 seconds.

sentrivana commented 10 months ago

We should properly define the expected behavior when a custom traces_sampler (or before_send{_transaction}, or error_sampler, or any other hook) throws an exception. Currently we usually log the error, but apparently in some cases the exception also gets intercepted by excepthook and makes it to Sentry. If that's how it should be, then this is a bug in the AWS Lambda integration since it apparently doesn't do that.

FWIW I do think we should make this sort of errors easier to discover, or at least make it clearer how to notice them. We've had some cases recently where folks were confused why there were no transactions appearing in Sentry and it took a long time to trace it back to a faulty before_send_transaction. Internal SDK exceptions should never affect the user's app, but these custom hooks are not SDK code.