getsentry / relay

Sentry event forwarding and ingestion service.
https://docs.sentry.io/product/relay/
Other
318 stars 91 forks source link

Low volume events rarely get through relay onto sentry- seeing odd errors #3873

Closed chipzzz closed 1 month ago

chipzzz commented 1 month ago

I'm trying to debug low volume ingestion through relay > sentry and hence missing data in sentry. Every time I push some data in i.e using the python sdk I rarely get any events in sentry, just doing basic sentry_sdk.init:

sentry_sdk.init(
    dsn='https://<public_key>@dev.sentry/3',
     debug=True,
     enable_tracing=True
)

def main():
    time.sleep(int(1))
    raise Exception("error")

Sentry Version (helm chart): 24.5.1

Relay Version: getsentry/relay:24.5.1

Current config(also tried basic config without these changes):

                 logging:
                   level: trace
                   log_failed_payloads: true
                 metrics:
                   statsd: sentry-metrics:9125
                   prefix: relay
                   hostname_tag: true
                 http:
                   timeout: 30
                   connection_timeout: 10
                 limits:
                   max_concurrent_requests: 1000
                   max_concurrent_queries: 100
                   max_event_size: 10MB
                   max_session_count: 1000

I get the following error:

2024-07-26T20:04:00.330017Z DEBUG request{method=POST uri=/api/3/envelope/ version=HTTP/1.1}: tower_http::trace::on_request: started processing request
2024-07-26T20:04:00.330391Z DEBUG request{method=POST uri=/api/3/envelope/ version=HTTP/1.1}: tower_http::trace::on_response: finished processing request latency=0 ms status=200
2024-07-26T20:04:00.330458Z DEBUG relay_server::envelope: failed to parse sampling context error=invalid type: floating point `1`, expected a string
2024-07-26T20:04:00.330470Z DEBUG relay_server::envelope: failed to parse sampling context error=invalid type: floating point `1`, expected a string
2024-07-26T20:04:00.330540Z TRACE relay_server::services::processor: Processing error group
2024-07-26T20:04:00.330579Z TRACE relay_server::services::processor::event: processing json event
2024-07-26T20:04:00.330865Z DEBUG relay_server::envelope: failed to parse sampling context error=invalid type: floating point `1`, expected a string
2024-07-26T20:04:00.330883Z DEBUG relay_server::envelope: failed to parse sampling context error=invalid type: floating point `1`, expected a string
2024-07-26T20:04:00.331238Z DEBUG relay_server::envelope: failed to parse sampling context error=invalid type: floating point `1`, expected a string
2024-07-26T20:04:00.331636Z TRACE relay_server::services::processor: sending envelope to kafka
2024-07-26T20:04:00.331697Z TRACE relay_server::services::store: Sending kafka message of type event

When not sending any data I keep getting rate limit errors, which I originally thought were related - not so sure anymore.

2024-07-26T20:09:00.127163Z DEBUG relay_server::utils::managed_envelope: dropped envelope: rate limited
2024-07-26T20:09:00.127107Z TRACE relay_server::services::processor: Processing check_in group
2024-07-26T20:09:00.127124Z TRACE relay_server::services::processor: Processing client_report group
2024-07-26T20:09:00.127163Z DEBUG relay_server::utils::managed_envelope: dropped envelope: rate limited
2024-07-26T20:09:00.127206Z TRACE relay_server::services::cogs: recording measurement: CogsMeasurement { resource: Relay, feature: ClientReports, value: Time(61.304µs) }

Also, when not sending in any data I see the following:

2024-07-26T20:11:01.618075Z DEBUG relay_server::envelope: failed to parse sampling context error=missing field `public_key`
2024-07-26T20:11:01.614641Z DEBUG request{method=POST uri=/api/1/envelope/ version=HTTP/1.1}: tower_http::trace::on_request: started processing request
2024-07-26T20:11:01.615736Z DEBUG request{method=POST uri=/api/1/envelope/ version=HTTP/1.1}: tower_http::trace::on_response: finished processing request latency=1 ms status=200
2024-07-26T20:11:01.615869Z DEBUG relay_server::envelope: failed to parse sampling context error=missing field `public_key`
2024-07-26T20:11:01.615890Z DEBUG relay_server::envelope: failed to parse sampling context error=missing field `public_key`
2024-07-26T20:11:01.616099Z TRACE relay_server::services::processor: Processing error group
2024-07-26T20:11:01.616121Z TRACE relay_server::services::processor::event: processing json event
2024-07-26T20:11:01.616116Z TRACE relay_server::services::processor: Processing client_report group
2024-07-26T20:11:01.616185Z DEBUG relay_server::utils::managed_envelope: dropped envelope: rate limited
2024-07-26T20:11:01.616248Z TRACE relay_server::services::cogs: recording measurement: CogsMeasurement { resource: Relay, feature: ClientReports, value: Time(205.815µs) }
2024-07-26T20:11:01.617350Z DEBUG relay_server::envelope: failed to parse sampling context error=missing field `public_key`
2024-07-26T20:11:01.617369Z DEBUG relay_server::envelope: failed to parse sampling context error=missing field `public_key`
2024-07-26T20:11:01.618075Z DEBUG relay_server::envelope: failed to parse sampling context error=missing field `public_key`
2024-07-26T20:11:01.623078Z TRACE relay_server::services::processor: sending envelope to kafka
2024-07-26T20:11:01.623187Z TRACE relay_server::services::store: Sending kafka message of type event

I'm trying to understand if these errors are something to be looking into as i'm trying to diagnose why I can't see the events come through Sentry, or they do very rarely at low volume.

I can't pin point if it's rate limiting, sampling or something else.

In a high volume environment events seem to come through fine however i'm still seeing constant dropped envelope: rate limited making me believe rate limiting is happening. I have rate limiting set to unlimited and I am not sampling on client side. I'm trying to understand what is going on, why I'm rarely seeing events come through sentry at low volume and if I'm actually rate limiting events at high volume( in my other environment.)

Also is dynamic sampling playing a role here? is that enabled in self-hosted sentry? I don't see settings for it as specified by these docs.

Is it possible that dynamic sampling is sampling/rate limiting data here? Shouldn't sample_rate: 1.0 prevent that if set on client side? It makes me believe there may be some dynamic sampling functionality happening, however, since I use self-hosted its control features are not fully plugged in.

Is it possible at all to fully disable dynamic sampling for those who fully want all events come through and are willing to provide the required resources for that kind of volume?

How does redis play a role in sampling?

Thank you

Sebastian

jjbayer commented 1 month ago

@chipzzz thank you for the detailed report!

Also, when not sending in any data I see the following:

From the logs, it looks like Relay does receive data. Is there another source sending events to the same Relay instance which might cause the rate limit being hit?

I can't pin point if it's rate limiting, sampling or something else.

Given the rate limiting logs I suspect it's rate limiting. You can check this by looking at stats (this page should work in self-hosted as well).

Also is dynamic sampling playing a role here? is that enabled in self-hosted sentry? I don't see settings for it as specified by these docs.

No, dynamic sampling applies to transaction events, not to error events.

Shouldn't sample_rate: 1.0 prevent that if set on client side?

No, the client-side sample rate only determines client side sampling, not server-side sampling.

Is it possible at all to fully disable dynamic sampling for those who fully want all events come through and are willing to provide the required resources for that kind of volume?

Yes, that should be possible by disabling organizations:dynamic-sampling. But my hunch is that DS is not the problem here.

How does redis play a role in sampling?

We use redis to propagate project configuration to relay, and redis counters to enforce rate limits consistently. For dynamic sampling, we only use it for reservoir sampling.

chipzzz commented 1 month ago

@jjbayer , thanks for all this detailed info, will be looking through it. Just 1 question at this time, if project rate limiting is not set, the only other possible rate limiting is set through system.rate-limit correct? Are there any other ones?

I've originally had this set to 0, now playing around with some number but doesn't really make the rate limiting events go away.

The stats page does not show anything being dropped. I've noticed in other issues that the stats page may not always be accurate.

Besides mine, there's no other things being sent in this environment- I suspect it's some internal things being sent perhaps? - will need to verify.

chipzzz commented 1 month ago

I've also started looking on the sdk side, some reported issues are relevant to mine

https://github.com/getsentry/sentry-java/issues/3494 https://github.com/getsentry/sentry-python/issues/2617

chipzzz commented 1 month ago

Also, it looks like in my helm chart self-hosted sentry organizations:dynamic-sampling is not present as a feature. I think I can safely ignore it. Also saw it was tried to be implemented in self-hosted but there was missing components.

jjbayer commented 1 month ago

@chipzzz

if project rate limiting is not set, the only other possible rate limiting is set through system.rate-limit correct? Are there any other ones?

There's a bunch of sources for rate limits. With superuser access you should be able to check

http://{YOUR_SENTRY_INSTANCE}/api/0/internal/project-config/?projectId={YOUR_PROJECT_ID}

and look for "quotas" to see which ones are active for your project.

chipzzz commented 1 month ago

@jjbayer , i don't think I can access that due to https://github.com/getsentry/sentry-docs/issues/7778 ?

Any other way to access this info? Where would I find quotas?

chipzzz commented 1 month ago

Created https://github.com/getsentry/self-hosted/issues/3235 as I have more findings given that the Since issue began accumulated count of events is correct, yet Total in last 14 days is incorrect and viewing in discover tiemchart as well as events list- there are events missing and also being overwritten it seems as past events are no longer present while they were before.

chipzzz commented 1 month ago

This was caused Snuba's helm chart default setting of single_node: true for Clickhouse. Clickhouse by default is deployed as a distributed cluster.

Thus when snuba was querying / inserting data into Clickhouse, it was under the impression that it was a single node deployment.

Not apparent at high volume.

https://github.com/sentry-kubernetes/charts/blob/develop/charts/sentry/values.yaml#L1782