getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
38.5k stars 4.11k forks source link

Stats page not showing any accepted transactions or profiles since upgrade from 21.8.0 to 23.6.2 #62568

Closed roganartu closed 5 months ago

roganartu commented 8 months ago

Self-Hosted Version

23.6.2

CPU Architecture

x86_64

Docker Version

19.03.13, build 4484c46d9d

Docker Compose Version

N/A

Steps to Reproduce

Unsure how exactly to repro as we don't run Sentry in the docker stack, and instead run a cluster of hosts with systemd units that mirror the docker images. I have gone through both our Salt configuration as well as the actual units running on machines several times, comparing to the commands in the docker-compose, and everything appears to be correct (though it's certainly possible I missed something).

Other perhaps interesting aspects of our environment are that we run Redpanda instead of Kafka, but since Redpanda just exposes the Kafka API and that's all Sentry uses here and this used to work pre-upgrade I'm not sure it matters in practice.

Expected Result

Stats page shows correct accepted, dropped, and filtered counts for Errors, Transactions, Attachments, and Profiles.

Actual Result

Only Errors stats appear correct. Transactions and Profiles both only show Dropped and Filtered, though afaict both are still being processed correctly (they appear in other views).

Errors, working correctly: image

Transactions, only showing Dropped and Filtered: image

Profiles, only showing Dropped: image

We don't use attachments, so those are just always zero.

This used to work fine on v21.8.0 before we upgraded about 5 months ago.

Perhaps related, we see perpetually increasing lag on several groups:

Example of this perpetual lag (this is very soon after we nuked the Kafka cluster data, hence the low offsets):

❯ rpk group describe snuba-transactions-subscriptions-consumers
GROUP        snuba-transactions-subscriptions-consumers
COORDINATOR  3
STATE        Stable
BALANCER     range
MEMBERS      12
TOTAL-LAG    78616

TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    MEMBER-ID                                     CLIENT-ID  HOST
snuba-transactions-commit-log  0          0               0               -      rdkafka-4054075a-039f-4724-941a-b72c70e00a07  rdkafka    10.16.97.179
snuba-transactions-commit-log  1          0               3936            3936   rdkafka-4054075a-039f-4724-941a-b72c70e00a07  rdkafka    10.16.97.179
snuba-transactions-commit-log  2          0               0               -      rdkafka-5dc23c89-2f67-4a11-9959-9cb88d4c5c46  rdkafka    10.16.88.140
snuba-transactions-commit-log  3          0               3935            3935   rdkafka-5dc23c89-2f67-4a11-9959-9cb88d4c5c46  rdkafka    10.16.88.140
snuba-transactions-commit-log  4          0               0               -      rdkafka-608e914a-0f7c-4321-a30f-c3e924142402  rdkafka    10.16.97.179
snuba-transactions-commit-log  5          0               3939            3939   rdkafka-608e914a-0f7c-4321-a30f-c3e924142402  rdkafka    10.16.97.179
snuba-transactions-commit-log  6          0               0               -      rdkafka-6aff0b2a-0d92-419b-b21a-a895af8251af  rdkafka    10.16.104.153
snuba-transactions-commit-log  7          0               0               -      rdkafka-6aff0b2a-0d92-419b-b21a-a895af8251af  rdkafka    10.16.104.153
snuba-transactions-commit-log  8          0               3924            3924   rdkafka-7ec74dae-c692-4d8c-8a3d-6c4803b92699  rdkafka    10.16.88.140
snuba-transactions-commit-log  9          0               3939            3939   rdkafka-7ec74dae-c692-4d8c-8a3d-6c4803b92699  rdkafka    10.16.88.140
snuba-transactions-commit-log  10         0               3941            3941   rdkafka-80bd5a41-4c88-4e21-b0a7-17867045d5f4  rdkafka    10.16.88.140
snuba-transactions-commit-log  11         0               3931            3931   rdkafka-80bd5a41-4c88-4e21-b0a7-17867045d5f4  rdkafka    10.16.88.140
snuba-transactions-commit-log  12         0               7856            7856   rdkafka-93ed3f3c-9315-4750-a4a7-896f4686e172  rdkafka    10.16.97.179
snuba-transactions-commit-log  13         0               7846            7846   rdkafka-93ed3f3c-9315-4750-a4a7-896f4686e172  rdkafka    10.16.97.179
snuba-transactions-commit-log  14         0               0               -      rdkafka-b32c0397-c0be-47e0-bb9b-7d43b06f1de1  rdkafka    10.16.88.140
snuba-transactions-commit-log  15         0               3933            3933   rdkafka-b32c0397-c0be-47e0-bb9b-7d43b06f1de1  rdkafka    10.16.88.140
snuba-transactions-commit-log  16         0               7873            7873   rdkafka-b75064d7-8da6-4cae-b5fa-9572b2a40a07  rdkafka    10.16.104.153
snuba-transactions-commit-log  17         0               0               -      rdkafka-bdfdf287-01eb-4cec-8d87-1043f62bc19f  rdkafka    10.16.104.153
snuba-transactions-commit-log  18         0               15703           15703  rdkafka-c0c17be2-9acc-482f-8c79-64124b0f97f5  rdkafka    10.16.104.153
snuba-transactions-commit-log  19         0               7860            7860   rdkafka-f6981507-3282-4273-8144-a0ec8feb1495  rdkafka    10.16.97.179

The consumer that the docker-compose associates with this group (snuba-subscription-consumer-transactions) has a small number of the following error in its logs, but it has always done this even before we upgraded to 23.6.2:

Dec 22 15:27:27 shasentry1 docker[71824]: 2023-12-22 20:27:27,880 Could not construct valid interval between MessageDetails(offset=28978, orig_message_ts=datetime.datetime(2023, 12, 22, 20, 27, 26, 863000)) and MessageDetails(offset=28986, orig_message_ts=datetime.datetime(2023, 12, 22, 20, 27, 26, 846000))!                                                                                                                                                                        
Dec 22 15:27:27 shasentry1 docker[71824]: Traceback (most recent call last):
Dec 22 15:27:27 shasentry1 docker[71824]:   File "/usr/src/snuba/snuba/subscriptions/scheduler_consumer.py", line 148, in poll
Dec 22 15:27:27 shasentry1 docker[71824]:     time_interval = Interval(
Dec 22 15:27:27 shasentry1 docker[71824]:   File "<string>", line 5, in __init__
Dec 22 15:27:27 shasentry1 docker[71824]:   File "/usr/src/snuba/snuba/utils/types.py", line 64, in __post_init__
Dec 22 15:27:27 shasentry1 docker[71824]:     raise InvalidRangeError(self.lower, self.upper)
Dec 22 15:27:27 shasentry1 docker[71824]: snuba.utils.types.InvalidRangeError: (datetime.datetime(2023, 12, 22, 20, 27, 26, 863000), datetime.datetime(2023, 12, 22, 20, 27, 26, 846000))

When I say "small amount", we process thousands of transactions a minute, and get around 5 of the above error in the logs every 10 mins or so. If it's relevant to this particular error, the machines that run Redpanda/Kafka are running in Eastern Time, while all the Sentry containers run in docker and hence are UTC.

Initially we thought the lack of stats was related to a group joining timeout error we were experiencing as there were snuba-commit-log-<random string> groups that kept perpetually getting added and never cleaned up (to the point there were at least thousands if not more) and an older version of redpanda didn't handle this well. However we recently upgraded Redpanda and cleared all data from the cluster in the process, fixing the group join timeouts, and this stats issue persists.

I understand our setup is not really the blessed path here and that it's entirely possible we're holding something wrong here, but we've exhausted all debugging paths we can by reading source with relatively limited understanding of the components, so any tips on what else we could look into would be appreciated. I can provide any logs or command outputs that might be useful.

Event ID

No response

hubertdeng123 commented 8 months ago

I myself am not familiar with this area of the product, so I will transfer this over to the sentry repo where folks can be of more help. However, note that since your setup is not officially supported and highly customized, I'm not sure how much help we can be.

getsantry[bot] commented 8 months ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 8 months ago

Routing to @getsentry/product-owners-stats for triage ⏲️

leedongwei commented 8 months ago

This is difficult to debug as I don't have an intuition of your setup. Let me check with the Snuba team if they recognize any patterns.

untitaker commented 8 months ago

Only Errors stats appear correct. Transactions and Profiles both only show Dropped and Filtered, though afaict both are still being processed correctly (they appear in other views).

if you are sure that just the stats page is broken (and not the actual product), you want to take a close look at the outcomes table in clickhouse, and the outcomes topic. that's how those stats are being generated. generally speaking, the bulk of the volume (filtered/discarded) is being emitted by relay into the outcomes topic. accepted for transactions comes from sentry in def save_event.

it's possible that relay is either not producing the correct stats, or that for some reason they do not end up in clickhouse. it should be possible to verify with kafkacat whether you have any Filtered outcomes for the relevant DataCategory.

here is a definition of the numbers for outcome json property in kafka: https://github.com/getsentry/relay/blob/66c2f8272a8dd372479a569feca4da845c9f37fe/relay-server/src/actors/outcome.rs#L76-L84

beyond those general debugging tips, not sure we can help. we attempted to run redpanda ourselves and got nowhere, i'm surprised it runs at all

untitaker commented 8 months ago

DataCategories are here: https://github.com/getsentry/relay/blob/66c2f8272a8dd372479a569feca4da845c9f37fe/relay-base-schema/src/data_category.rs#L15-L76

roganartu commented 8 months ago

Firstly, thanks for all the pointers. This is really useful, and you probably saved me many hours of digging to find the same.

if you are sure that just the stats page is broken (and not the actual product)

Fairly, yeah. We are certainly processing transactions and the volumes seem similar to pre-upgrade. We can search and view them, and using the Discover part of the product we can get aggregated counts of transactions still, despite the stats page indicating a perpetually-zero "Accepted" count.

you want to take a close look at the outcomes table in clickhouse

The data does appear to be in Clickhouse. I'm not sure whether the category is correct, or if this data absolves Redpanda of any potential blame here. fwiw we are migrating off Redpanda to Kafka in the next week or so (for unrelated reasons, simply moving ownership of that component within our org to an existing team with more Kafka experience), so I suppose we will see then if Redpanda is contributing.

SELECT
    category,
    outcome,
    reason,
    sum(quantity)
FROM outcomes_raw_local
GROUP BY
    category,
    outcome,
    reason
ORDER BY sum(quantity) DESC

Query id: 301c1244-dcbb-4f06-8a3f-b55be9960b21

┌─category─┬─outcome─┬─reason───────────────────────────┬─sum(quantity)─┐
│        2 │       5 │ sample_rate                      │    7529418783 │
│        9 │       0 │ ᴺᵁᴸᴸ                             │     357893253 │
│        1 │       0 │ ᴺᵁᴸᴸ                             │     126100743 │
│        6 │       5 │ sample_rate                      │      44530607 │
│        1 │       2 │ project_quota                    │      44370710 │
│        2 │       2 │ project_quota                    │      24009310 │
│        2 │       5 │ ratelimit_backoff                │      17807691 │
│        2 │       1 │ filtered-transaction             │      17705941 │
│        1 │       5 │ ratelimit_backoff                │      16552339 │
│        9 │       2 │ project_quota                    │      12608175 │
│        1 │       5 │ network_error                    │       8174943 │
│        0 │       5 │ network_error                    │       8090205 │
│        1 │       5 │ queue_overflow                   │       8031266 │
│        1 │       5 │ event_processor                  │       1679459 │
│        6 │       5 │ insufficient_data                │       1351828 │
│        0 │       5 │ ratelimit_backoff                │        919310 │
│        1 │       3 │ internal                         │        669704 │
│        1 │       5 │ sample_rate                      │        570348 │
│        1 │       1 │ error-message                    │        476997 │
│       11 │       0 │ ᴺᵁᴸᴸ                             │        343917 │
│        2 │       5 │ queue_overflow                   │        157061 │
│        2 │       3 │ internal                         │        147371 │
│        9 │       3 │ internal                         │        107377 │
│        2 │       5 │ backpressure                     │         58396 │
│        2 │       5 │ network_error                    │         20499 │
│        2 │       3 │ too_large                        │         16221 │
│        6 │       2 │ project_quota                    │         15036 │
│        1 │       5 │ before_send                      │          1667 │
│        6 │       5 │ queue_overflow                   │           983 │
│        2 │       3 │ invalid_transaction              │           172 │
│        6 │       3 │ internal                         │           103 │
│        1 │       3 │ too_large                        │            78 │
│       11 │       3 │ internal                         │            53 │
│        0 │       5 │ queue_overflow                   │            48 │
│        5 │       5 │ network_error                    │            42 │
│        1 │       1 │ discarded-hash                   │            21 │
│       11 │       2 │ project_quota                    │            14 │
│       11 │       3 │ profiling_failed_vroom_insertion │            14 │
│        6 │       3 │ too_large                        │             8 │
│        2 │       3 │ timestamp                        │             6 │
│        6 │       5 │ network_error                    │             3 │
│        6 │       1 │ filtered-transaction             │             2 │
└──────────┴─────────┴──────────────────────────────────┴───────────────┘

42 rows in set. Elapsed: 0.709 sec. Processed 510.67 million rows, 3.58 GB (719.94 million rows/s., 5.04 GB/s.)

Getting some daily counts to make the numbers a bit more tangible:

SELECT
    toYear(timestamp) AS year,
    toMonth(timestamp) AS month,
    toDayOfMonth(timestamp) AS day,
    category,
    outcome,
    reason,
    sum(quantity)
FROM outcomes_raw_local
WHERE (year = 2024) AND (category = 9)
GROUP BY
    year,
    month,
    day,
    category,
    outcome,
    reason
ORDER BY
    year ASC,
    month ASC,
    day ASC

Query id: 581255fb-7619-4369-8e26-e9fbeea736c2

┌─year─┬─month─┬─day─┬─category─┬─outcome─┬─reason────────┬─sum(quantity)─┐
│ 2024 │     1 │   1 │        9 │       2 │ project_quota │        518304 │
│ 2024 │     1 │   1 │        9 │       0 │ ᴺᵁᴸᴸ          │      10358623 │
│ 2024 │     1 │   1 │        9 │       3 │ internal      │          3751 │
│ 2024 │     1 │   2 │        9 │       2 │ project_quota │        538604 │
│ 2024 │     1 │   2 │        9 │       3 │ internal      │          4414 │
│ 2024 │     1 │   2 │        9 │       0 │ ᴺᵁᴸᴸ          │      12473107 │
│ 2024 │     1 │   3 │        9 │       3 │ internal      │          5281 │
│ 2024 │     1 │   3 │        9 │       2 │ project_quota │        554335 │
│ 2024 │     1 │   3 │        9 │       0 │ ᴺᵁᴸᴸ          │      13192023 │
│ 2024 │     1 │   4 │        9 │       0 │ ᴺᵁᴸᴸ          │      13267369 │
│ 2024 │     1 │   4 │        9 │       3 │ internal      │          4725 │
│ 2024 │     1 │   4 │        9 │       2 │ project_quota │        504937 │
│ 2024 │     1 │   5 │        9 │       0 │ ᴺᵁᴸᴸ          │       8885407 │
│ 2024 │     1 │   5 │        9 │       2 │ project_quota │        154286 │
│ 2024 │     1 │   5 │        9 │       3 │ internal      │          3368 │
└──────┴───────┴─────┴──────────┴─────────┴───────────────┴───────────────┘

15 rows in set. Elapsed: 0.254 sec. Processed 83.90 million rows, 906.55 MB (330.84 million rows/s., 3.57 GB/s.)

And if we look at the counts in the actual transactions table (which I assume is what I'd be querying via the Discover feature?) we can see that the numbers almost perfectly match the category = 9 and outcome = 0 counts:

SELECT
    toYear(timestamp) AS year,
    toMonth(timestamp) AS month,
    toDayOfMonth(timestamp) AS day,
    count(*)
FROM transactions_local
WHERE year = 2024
GROUP BY
    year,
    month,
    day
ORDER BY
    year ASC,
    month ASC,
    day ASC

Query id: 378d2754-5e8b-4904-9d83-3043bd703142

┌─year─┬─month─┬─day─┬──count()─┐
│ 2024 │     1 │   1 │ 10358642 │
│ 2024 │     1 │   2 │ 12473081 │
│ 2024 │     1 │   3 │ 13192019 │
│ 2024 │     1 │   4 │ 13267367 │
│ 2024 │     1 │   5 │  8948744 │
└──────┴───────┴─────┴──────────┘

5 rows in set. Elapsed: 0.264 sec. Processed 58.24 million rows, 232.96 MB (220.92 million rows/s., 883.68 MB/s.)

If it matters, outcomes_hourly_local also seems to match the data in outcomes_raw_local:

SELECT
    toYear(timestamp) AS year,
    toMonth(timestamp) AS month,
    toDayOfMonth(timestamp) AS day,
    category,
    outcome,
    reason,
    sum(quantity)
FROM outcomes_hourly_local
WHERE (year = 2024) AND (category = 9)
GROUP BY
    year,
    month,
    day,
    category,
    outcome,
    reason
ORDER BY
    year ASC,
    month ASC,
    day ASC

Query id: a1793bea-632e-4e94-8ac0-a51b25944f3f

┌─year─┬─month─┬─day─┬─category─┬─outcome─┬─reason────────┬─sum(quantity)─┐
│ 2024 │     1 │   1 │        9 │       2 │ project_quota │        518304 │
│ 2024 │     1 │   1 │        9 │       0 │ none          │      10358623 │
│ 2024 │     1 │   1 │        9 │       3 │ internal      │          3751 │
│ 2024 │     1 │   2 │        9 │       3 │ internal      │          4414 │
│ 2024 │     1 │   2 │        9 │       2 │ project_quota │        538604 │
│ 2024 │     1 │   2 │        9 │       0 │ none          │      12473107 │
│ 2024 │     1 │   3 │        9 │       0 │ none          │      13192023 │
│ 2024 │     1 │   3 │        9 │       3 │ internal      │          5281 │
│ 2024 │     1 │   3 │        9 │       2 │ project_quota │        554335 │
│ 2024 │     1 │   4 │        9 │       0 │ none          │      13267369 │
│ 2024 │     1 │   4 │        9 │       3 │ internal      │          4725 │
│ 2024 │     1 │   4 │        9 │       2 │ project_quota │        504937 │
│ 2024 │     1 │   5 │        9 │       3 │ internal      │          3804 │
│ 2024 │     1 │   5 │        9 │       0 │ none          │      10613769 │
│ 2024 │     1 │   5 │        9 │       2 │ project_quota │        156942 │
└──────┴───────┴─────┴──────────┴─────────┴───────────────┴───────────────┘

15 rows in set. Elapsed: 0.010 sec. Processed 36.72 thousand rows, 551.43 KB (3.76 million rows/s., 56.45 MB/s.)

We do have a slightly modified quota class (subclassing redis.RedisQuota and applying quotas from a config file, for enforcing an overall org quota as well as per-project quotas on events and transactions) that I thought was correctly updated when we upgraded Sentry, but I'll take a closer look at the diff now with this new information. We have not noticed any issues with rate limiting itself post-upgrade (and the outcomes numbers seem to support this, with the reason = "project_quota" entries). Could values the Relay gets from this quota class affect how ingested events themselves are categorised? We run vanilla Relay and Snuba (except for an injected configuration file) images.

Concretely, the quota class simply extends get_quotas to sprinkle in a number of sentry.quotas.base.QuotaConfig entries (one for overall org quota, and a per-project total, errors, and transactions quotas), simply to protect our install against badly behaved clients. I can provide the full source if it would be useful, it's fairly short and simple.

untitaker commented 8 months ago

please provide the code but we really do not support this kind of setup at all. i think you would be better off configuring rate limits in an external load balancer although that comes with obvious drawbacks wrt granularity

roganartu commented 5 months ago

Sorry for the delay getting back here, we were doing a few different things with our cluster that kept us occupied (a Postgres upgrade and migrating from Redpanda to Kafka, mainly).

I was looking into something else this week and found myself reading the docker-compose.yml file from the self-hosted repo for an unrelated issue. I eventually stumbled across this commit https://github.com/getsentry/self-hosted/commit/cad1deb8b28f9a3562e20f43db6595f279ab54a1 which adds some new consumers and turns on some flags, but was released after the version we are currently running. I reviewed our sentry.conf.py file and found that these flags had been inadvertantly turned on during our last upgrade, but the consumers had not been created. After adding them, our issue seems to have resolved and the tx and profiles stats pages now show accepted counts as expected.

Apologies for the noise here, I appreciate the time taken to look into what turned out to be a self-inflicted issue :sweat:.