getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
38.74k stars 4.15k forks source link

Transactions, replays, attachments and profiles are not visible in the stats section #55786

Open kadet1090 opened 1 year ago

kadet1090 commented 1 year ago

Self-Hosted Version

23.7.2

CPU Architecture

x86_64

Docker Version

non-important (k8s cluster)

Docker Compose Version

non-important (k8s cluster)

Steps to Reproduce

Yes, I know that I've got non supported setup - I'm not asking for solving my particular case, rather for more information where to begin troubleshooting which should be useful if anyone has similar problem.

I do not know how to reproduce this, probably this is very specific issue - I'm looking for guidance where to look. I'd gladly attach any logs needed - I need to know at which container I should be looking :)

Expected Result

Everything with my setup works fine, except for the stats - only errors are collected and counted correctly. Other metrics, such as transactions or profiles are not counted and are always displayed as 0. Where should I begin to look for the issue? Transactions are visible in the "Performance", could be seen in dashboards and "Discovery", only place that says that I've got no transactions is stats.

AFAIR it stopped working after I've updated to 23.7.0, before that it was ok.

So the question is - how stats are counted, so I could pinpoint the issue? I'll describe solution for anyone to use where found.

image image

Actual Result

Stats should be counted for all event types.

Event ID

No response

zetaab commented 1 year ago

we are facing same issue currently. Anyways the organisation stats api endpoint is this https://github.com/getsentry/sentry/blob/master/src/sentry/api/endpoints/organization_stats_v2.py#L134 this is talking about "outcomes".

Also https://github.com/getsentry/snuba/blob/master/docs/source/architecture/overview.rst#sessions-and-outcomes says

while Outcomes mainly provide data to the Sentry stats page. Both pipelines have their own Kafka topic, Kafka consumer and they write on their own table in Clickhouse.

going to clickhouse and checking the stats:

select * from outcomes_hourly_dist order by timestamp desc limit 1000; select * from outcomes_raw_dist order by timestamp desc limit 1000;

which clearly shows that stats are not flowing in for some reason.

zetaab commented 1 year ago

checking the outcomes topic from kafka

start some container which has kafka:

kubectl run kafka --image=quay.io/strimzi/kafka:0.34.0-kafka-3.4.0 -i --tty --rm -- sh

first get max offset (and partitions) for topic: bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list <broker>:9092 --topic outcomes

in my case the result is

sh-4.4$ bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list <broker>:9092 --topic outcomes
outcomes:0:7409576

so the max offset is 7409576

if I check the content using bin/kafka-console-consumer.sh --bootstrap-server <broker>:9092 --topic offset --offset 7409576 --partition 0 I can see that there are newer content.

So what is wrong here, lets check the logs from snuba-outcomes-consumer:

% kubectl logs sentry-snuba-outcomes-consumer-85d596c668-r4875
2023-09-06 05:11:10,714 Initializing Snuba...
2023-09-06 05:11:26,316 Snuba initialization took 15.598081367090344s
2023-09-06 05:11:26,329 Consumer Starting
2023-09-06 05:11:26,330 librdkafka log level: 6
2023-09-06 05:11:27,261 New partitions assigned: {Partition(topic=Topic(name='outcomes'), index=0): 7236764}
2023-09-06 05:31:53,612 Partitions to revoke: [Partition(topic=Topic(name='outcomes'), index=0)]
2023-09-06 05:31:53,613 Closing <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fc1f54ce2b0>...
2023-09-06 05:31:53,615 Waiting for <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fc1f54ce2b0> to exit...
2023-09-06 05:31:58,616 Timed out with 2 futures in queue
2023-09-06 05:31:58,617 <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fc1f54ce2b0> exited successfully, releasing assignment.
2023-09-06 05:31:58,618 Partition revocation complete.
2023-09-06 05:31:58,625 New partitions assigned: {Partition(topic=Topic(name='outcomes'), index=0): 7247390}
2023-09-06 05:32:14,649 Partitions to revoke: [Partition(topic=Topic(name='outcomes'), index=0)]
2023-09-06 05:32:14,650 Closing <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fc1fba0b310>...
2023-09-06 05:32:14,650 Waiting for <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fc1fba0b310> to exit...
2023-09-06 05:32:19,710 Timed out with 2 futures in queue
2023-09-06 05:32:19,718 <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fc1fba0b310> exited successfully, releasing assignment.
2023-09-06 05:32:19,719 Partition revocation complete.
2023-09-06 05:32:19,727 New partitions assigned: {Partition(topic=Topic(name='outcomes'), index=0): 7247519}

like we can clearly see, snuba-outcomes-consumer is using offset 7247519 which is about 200000 messages behind the current. So for some reason snuba-outcomes-consumer is stuck.

When starting the outcomes-consumer component with --log-level DEBUG I can see following

2023-09-06 05:49:39,042 Caught MessageRejected() while submitting BrokerValue(_BrokerValue__payload=KafkaPayload(key=None, value=b'{"timestamp":"2023-09-05T15:32:49.000000Z","org_id":5,"project_id":89,"key_id":93,"outcome":0,"reason":null,"event_id":"9fa05181447e4b088a29abc2bd045e8c","category":9,"quantity":1}', headers=[]), partition=Partition(topic=Topic(name='outcomes'), index=0), offset=7256468, timestamp=datetime.datetime(2023, 9, 5, 18, 53, 45, 417000)), pausing consumer...

so for some reason kafka is rejecting data (disk usage currently 88% might be one reason).

zetaab commented 1 year ago

now we have more disk in kafka. Looks like outcomes started to execute backlog and clickhouse is receiving new rows

kadet1090 commented 1 year ago

I've checked this services in my setup, and I don't see any lag, offset seems also fine but there are still no stats. So our problems are probably different in nature. Thanks for starting point, I'll look into that and if I found anything that can be useful I'll add another comment.

hubertdeng123 commented 1 year ago

I'm not too familiar with outcomes myself, but can redirect to the team in particular that may have more information here.

getsantry[bot] commented 1 year ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 1 year ago

Routing to @getsentry/product-owners-stats for triage ⏲️

Dhrumil-Sentry commented 1 year ago

@hubertdeng123 This is something likely to do with how self-hosted sentry is deployed - we can add it to the backlog for an investigation later

hubertdeng123 commented 1 year ago

Thanks @Dhrumil-Sentry! I agree, as they're using an unsupported version so something is going wrong with how it is deployed. They'd like to know more of the technical details around what may be going wrong. Looks like the stats page pulls directly from snuba, so if there is no information there that's the main issue

hubertdeng123 commented 1 year ago

Just to confirm, @zetaab are you using self-hosted on docker compose? And Stats is working fine for you now?

zetaab commented 1 year ago

We are using kubernetes https://github.com/sentry-kubernetes/charts/tree/develop/sentry

And stats are working

kadet1090 commented 1 year ago

My setup is based on the official, docker-compose.yaml which I simply "translated" to k8s manifests (I've also looked at chart linked above, but it had more infrastructure that my org requires). So things that are actually different are storage and networking - but both of those things are probably not the cause as everything else works. Something that is very strange for me is that in my setup Errors are counted correctly, and also Replays seem to work now.

As for the logs:

kadet1090 commented 1 year ago

Okey, to iterate further, I've got data in the outcomes_raw_dist table. I've looked up categories in the https://github.com/getsentry/relay/blob/509fcdddf93132afd1250b83e375dceee038e4db/relay-base-schema/src/data_category.rs#L15

sentry-clickhouse :) SELECT * FROM default.outcomes_raw_dist WHERE category = 2 ORDER BY timestamp desc LIMIT 10;

SELECT *
FROM default.outcomes_raw_dist
WHERE category = 2
ORDER BY timestamp DESC
LIMIT 10

Query id: 5c4bd723-95eb-4ada-820b-a9aefd286619

┌─org_id─┬─project_id─┬─key_id─┬───────────timestamp─┬─outcome─┬─reason──────┬─event_id─┬─quantity─┬─category─┬─size─┐
│      1 │          1 │      1 │ 2023-09-07 17:33:00 │       5 │ sample_rate │ ᴺᵁᴸᴸ     │        5 │        2 │    0 │
└────────┴────────────┴────────┴─────────────────────┴─────────┴─────────────┴──────────┴──────────┴──────────┴──────┘
┌─org_id─┬─project_id─┬─key_id─┬───────────timestamp─┬─outcome─┬─reason──────┬─event_id─┬─quantity─┬─category─┬─size─┐
│      1 │          4 │      4 │ 2023-09-07 10:59:00 │       5 │ sample_rate │ ᴺᵁᴸᴸ     │        1 │        2 │    0 │
│      1 │          1 │      1 │ 2023-09-07 10:56:00 │       5 │ sample_rate │ ᴺᵁᴸᴸ     │        3 │        2 │    0 │
│      1 │          4 │      4 │ 2023-09-07 10:54:00 │       5 │ sample_rate │ ᴺᵁᴸᴸ     │        4 │        2 │    0 │
│      1 │          4 │      4 │ 2023-09-07 10:53:00 │       5 │ sample_rate │ ᴺᵁᴸᴸ     │        1 │        2 │    0 │
│      1 │          4 │      4 │ 2023-09-07 10:52:00 │       5 │ sample_rate │ ᴺᵁᴸᴸ     │        2 │        2 │    0 │
│      1 │          4 │      4 │ 2023-09-07 10:51:00 │       5 │ sample_rate │ ᴺᵁᴸᴸ     │        1 │        2 │    0 │
│      1 │          1 │      1 │ 2023-09-07 05:22:00 │       5 │ sample_rate │ ᴺᵁᴸᴸ     │        3 │        2 │    0 │
│      1 │          1 │      1 │ 2023-09-06 07:48:00 │       5 │ sample_rate │ ᴺᵁᴸᴸ     │        2 │        2 │    0 │
│      1 │          1 │      1 │ 2023-09-06 07:46:00 │       5 │ sample_rate │ ᴺᵁᴸᴸ     │        2 │        2 │    0 │
└────────┴────────────┴────────┴─────────────────────┴─────────┴─────────────┴──────────┴──────────┴──────────┴──────┘

10 rows in set. Elapsed: 0.028 sec. Processed 39.25 thousand rows, 2.24 MB (1.42 million rows/s., 81.06 MB/s.)
sentry-clickhouse :) SELECT * FROM default.outcomes_raw_dist WHERE category = 9 ORDER BY timestamp desc LIMIT 10;

SELECT *
FROM default.outcomes_raw_dist
WHERE category = 9
ORDER BY timestamp DESC
LIMIT 10

Query id: 6c6c145c-6c66-4b26-8e87-426a4c4b934b

┌─org_id─┬─project_id─┬─key_id─┬───────────timestamp─┬─outcome─┬─reason─┬─event_id─────────────────────────────┬─quantity─┬─category─┬─size─┐
│      1 │          4 │      4 │ 2023-09-07 17:40:29 │       0 │ ᴺᵁᴸᴸ   │ 562a3984-8051-443f-ad1e-a3080e21a824 │        1 │        9 │    0 │
└────────┴────────────┴────────┴─────────────────────┴─────────┴────────┴──────────────────────────────────────┴──────────┴──────────┴──────┘
┌─org_id─┬─project_id─┬─key_id─┬───────────timestamp─┬─outcome─┬─reason─┬─event_id─────────────────────────────┬─quantity─┬─category─┬─size─┐
│      1 │          2 │      2 │ 2023-09-07 17:40:26 │       0 │ ᴺᵁᴸᴸ   │ 33e8d3ac-15ea-4d4a-a772-cf558d39078b │        1 │        9 │    0 │
└────────┴────────────┴────────┴─────────────────────┴─────────┴────────┴──────────────────────────────────────┴──────────┴──────────┴──────┘
┌─org_id─┬─project_id─┬─key_id─┬───────────timestamp─┬─outcome─┬─reason─┬─event_id─────────────────────────────┬─quantity─┬─category─┬─size─┐
│      1 │          4 │      4 │ 2023-09-07 17:19:09 │       0 │ ᴺᵁᴸᴸ   │ f8f3404a-7c98-482b-bfcb-a4a312477f73 │        1 │        9 │    0 │
│      1 │          4 │      4 │ 2023-09-07 17:16:15 │       0 │ ᴺᵁᴸᴸ   │ be9a5f06-094e-4bab-86cf-5574eca398dc │        1 │        9 │    0 │
│      1 │          2 │      2 │ 2023-09-07 17:16:13 │       0 │ ᴺᵁᴸᴸ   │ e455fbf5-335c-487d-83a8-ebf83581d2d5 │        1 │        9 │    0 │
│      1 │          4 │      4 │ 2023-09-07 17:09:16 │       0 │ ᴺᵁᴸᴸ   │ 997e0c2d-da72-4d29-916a-e6685f8b7d9a │        1 │        9 │    0 │
│      1 │          4 │      4 │ 2023-09-07 17:02:16 │       0 │ ᴺᵁᴸᴸ   │ d4527801-ad16-45dc-9b3e-c7dbcf197c83 │        1 │        9 │    0 │
│      1 │          4 │      4 │ 2023-09-07 16:54:01 │       0 │ ᴺᵁᴸᴸ   │ a633a3ac-10f6-4954-abb8-56e27cf25aa8 │        1 │        9 │    0 │
│      1 │          4 │      4 │ 2023-09-07 16:51:05 │       0 │ ᴺᵁᴸᴸ   │ 886afd06-767d-4e29-acdc-9772b9e092b2 │        1 │        9 │    0 │
│      1 │          2 │      2 │ 2023-09-07 16:47:47 │       0 │ ᴺᵁᴸᴸ   │ 70b28e06-0904-4547-99c9-93fc44b7c9d2 │        1 │        9 │    0 │
└────────┴────────────┴────────┴─────────────────────┴─────────┴────────┴──────────────────────────────────────┴──────────┴──────────┴──────┘

Data is also available in _hourly variant:

sentry-clickhouse :) SELECT * FROM default.outcomes_hourly_dist WHERE category = 9 ORDER BY timestamp desc LIMIT 20;

SELECT *
FROM default.outcomes_hourly_dist
WHERE category = 9
ORDER BY timestamp DESC
LIMIT 20

Query id: 020e4859-8a9e-4fcb-959d-512f0d2ea43a

┌─org_id─┬─project_id─┬─key_id─┬───────────timestamp─┬─outcome─┬─reason─┬─times_seen─┬─quantity─┬─category─┬─bytes_received─┐
│      1 │          2 │      2 │ 2023-09-07 17:00:00 │       0 │ none   │          2 │        2 │        9 │              0 │
│      1 │          4 │      4 │ 2023-09-07 17:00:00 │       0 │ none   │          7 │        7 │        9 │              0 │
│      1 │          2 │      2 │ 2023-09-07 16:00:00 │       0 │ none   │          8 │        8 │        9 │              0 │
│      1 │          4 │      4 │ 2023-09-07 16:00:00 │       0 │ none   │          8 │        8 │        9 │              0 │
│      1 │          2 │      2 │ 2023-09-07 15:00:00 │       0 │ none   │          6 │        6 │        9 │              0 │
│      1 │          4 │      4 │ 2023-09-07 15:00:00 │       0 │ none   │          3 │        3 │        9 │              0 │
│      1 │          2 │      2 │ 2023-09-07 14:00:00 │       0 │ none   │         12 │       12 │        9 │              0 │
│      1 │          4 │      4 │ 2023-09-07 14:00:00 │       0 │ none   │          7 │        7 │        9 │              0 │
│      1 │          4 │      4 │ 2023-09-07 13:00:00 │       0 │ none   │         11 │       11 │        9 │              0 │
│      1 │          2 │      2 │ 2023-09-07 13:00:00 │       0 │ none   │         20 │       20 │        9 │              0 │
│      1 │         11 │     11 │ 2023-09-07 12:00:00 │       0 │ none   │          1 │        1 │        9 │              0 │
│      1 │          2 │      2 │ 2023-09-07 12:00:00 │       0 │ none   │         30 │       30 │        9 │              0 │
│      1 │          4 │      4 │ 2023-09-07 12:00:00 │       0 │ none   │          9 │        9 │        9 │              0 │
│      1 │          2 │      2 │ 2023-09-07 11:00:00 │       0 │ none   │         17 │       17 │        9 │              0 │
│      1 │          4 │      4 │ 2023-09-07 11:00:00 │       0 │ none   │         10 │       10 │        9 │              0 │
│      1 │          2 │      2 │ 2023-09-07 10:00:00 │       0 │ none   │         46 │       46 │        9 │              0 │
│      1 │          4 │      4 │ 2023-09-07 10:00:00 │       0 │ none   │         20 │       20 │        9 │              0 │
│      1 │          2 │      2 │ 2023-09-07 09:00:00 │       0 │ none   │         27 │       27 │        9 │              0 │
│      1 │          4 │      4 │ 2023-09-07 09:00:00 │       0 │ none   │         16 │       16 │        9 │              0 │
│      1 │         11 │     11 │ 2023-09-07 08:00:00 │       0 │ none   │          1 │        1 │        9 │              0 │
└────────┴────────────┴────────┴─────────────────────┴─────────┴────────┴────────────┴──────────┴──────────┴────────────────┘

20 rows in set. Elapsed: 0.022 sec. Processed 2.31 thousand rows, 127.62 KB (107.03 thousand rows/s., 5.92 MB/s.)

So... I'm puzzled, problem must be on the reading side, but where could it be?