getsentry / self-hosted

Sentry, feature-complete and packaged up for low-volume deployments and proofs-of-concept
https://develop.sentry.dev/self-hosted/
Other
7.78k stars 1.76k forks source link

Sentry stopped accepting transaction data #2876

Open ingria opened 7 months ago

ingria commented 7 months ago

Self-Hosted Version

24.3.0.dev0

CPU Architecture

x86_x64

Docker Version

24.0.4

Docker Compose Version

24.0.4

Steps to Reproduce

Update to the latest master

Expected Result

Everything works fine

Actual Result

Performance page shows zeros for the time period since the update and until now:

image

Project page shows the correct info about transactions and errors:

image

Stats page shows 49k transactions of which 49k are dropped:

image

Same for errors:

image

Event ID

No response

UPD

there are a lot of errors in clickhouse container:

2024.03.10 23:40:34.789282 [ 46 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):

0. Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x13c4ee8e in /usr/bin/clickhouse
1. Poco::Net::SocketImpl::peerAddress() @ 0x13c510d6 in /usr/bin/clickhouse
2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x101540cd in /usr/bin/clickhouse
3. DB::HTTPServerRequest::HTTPServerRequest(std::__1::shared_ptr<DB::Context const>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x110e6fd5 in /usr/bin/clickhouse
4. DB::HTTPServerConnection::run() @ 0x110e5d6e in /usr/bin/clickhouse
5. Poco::Net::TCPServerConnection::start() @ 0x13c5614f in /usr/bin/clickhouse
6. Poco::Net::TCPServerDispatcher::run() @ 0x13c57bda in /usr/bin/clickhouse
7. Poco::PooledThread::run() @ 0x13d89e59 in /usr/bin/clickhouse
8. Poco::ThreadImpl::runnableEntry(void*) @ 0x13d860ea in /usr/bin/clickhouse
9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
 (version 21.8.13.1.altinitystable (altinity build))
ingria commented 7 months ago

Also, for some reason Sentry started dropping incoming errors some time ago (as if I was using saas sentry):

image

barisyild commented 7 months ago

Did you change the port? I had the same situation when I changed the port.

ingria commented 7 months ago

Yes, I have the relay port exposed to the host network. How did you manage to fix the problem?

barisyild commented 7 months ago

Yes, I have the relay port exposed to the host network. How did you manage to fix the problem?

When I reverted the port change the problem was resolved.

ingria commented 7 months ago

Nope, didn't help. Doesn't work even with default config. Thanks for the tip though

hubertdeng123 commented 6 months ago

Are there any logs in your web container that can help? Are you sure you are receiving the event envelopes? You should be able to see that activity in your nginx container.

linxiaowang commented 6 months ago

Same here, on the browser side, there is a request sent with an event type of "transaction", but there is no data displayed under "performance", and the number of transactions in the project is also 0.

linxiaowang commented 6 months ago

Same here, on the browser side, there is a request sent with an event type of "transaction", but there is no data displayed under "performance", and the number of transactions in the project is also 0.

Problem solved, server time not match the sdk time.

ingria commented 6 months ago

I can see that there are successful requests to /api/2/envelope:

image

Also I can see transaction statistics on the projects page:

Number 394k for the last 24 hours is about right.

hubertdeng123 commented 6 months ago

Are you on a nightly version of self-hosted? What does your sentry.conf.py look like? We've added some feature flags there to support the new performance features

ingria commented 6 months ago

I'm using docker with the latest commit from this repository. Bottom of the page says Sentry 24.3.0.dev0 unknown. So I guess that's nightly.

I've updated sentry.conf.py to match the most recent version from this repo - now the only difference is in SENTRY_SINGLE_ORGANIZATION and CSRF_TRUSTED_ORIGINS variables.

After that, errors have also disappeared:

image

williamdes commented 6 months ago

I can confirm that the clickhouse errors are due to the Rust workers, reverting the workers part of #2831 and #2861 made the errors disappear. But still I have a too high dropping of transactions since the upgrade.

Worker code: https://github.com/getsentry/snuba/blob/359878fbe030a63945914ef05e705224680b453c/rust_snuba/src/strategies/clickhouse.rs#L61

Workers logs show that insert is done (is it ?): "timestamp":"2024-03-16T11:40:52.491448Z","level":"INFO","fields":{"message":"Inserted 29 rows"},

aldy505 commented 6 months ago

The error is caused by connection being prematurely closed. See https://github.com/getsentry/self-hosted/issues/2900

LvckyAPI commented 6 months ago

Same issue on latest 24.3.0

image

image

LvckyAPI commented 6 months ago

errors are not logged to

aldy505 commented 6 months ago

Okay so I'm able to replicate this issue on my instance (24.3.0). What happen is that Sentry does accept transaction/errors/profiles/replays/attachments data, but it doesn't record it on the statistics. So your stats of ingested events might be displayed as is there were no events being recorded, but actually the events are there -- it's processed by Snuba and you can view it on the web UI.

Can anyone reading this confirm that that's what happened on your instances as well? (I don't want to ping everybody)

If the answer to that 👆🏻 is "yes", that means something (a module, container, or something) that ingest the events didn't do data insertion correctly for it to be queried as statistics. I don't know for sure whether it's the responsibility of Snuba consumers (as we moved to rust-consumer just on 24.3.0) or Sentry workers, but I'd assume it's Snuba consumers.

A few solution (well not really but I hope this would get rid of this issue) for this is, either:

  1. Fix the issue somewhere, cut a patch release.
  2. If it's caused by rust-consumers, then we might rollback the usage of rust-consumer and just go back to old Python ones.
LvckyAPI commented 6 months ago

Okay so I'm able to replicate this issue on my instance (24.3.0). What happen is that Sentry does accept transaction/errors/profiles/replays/attachments data, but it doesn't record it on the statistics. So your stats of ingested events might be displayed as is there were no events being recorded, but actually the events are there -- it's processed by Snuba and you can view it on the web UI.

Can anyone reading this confirm that that's what happened on your instances as well? (I don't want to ping everybody)

If the answer to that 👆🏻 is "yes", that means something (a module, container, or something) that ingest the events didn't do data insertion correctly for it to be queried as statistics. I don't know for sure whether it's the responsibility of Snuba consumers (as we moved to rust-consumer just on 24.3.0) or Sentry workers, but I'd assume it's Snuba consumers.

A few solution (well not really but I hope this would get rid of this issue) for this is, either:

  1. Fix the issue somewhere, cut a patch release.
  2. If it's caused by rust-consumers, then we might rollback the usage of rust-consumer and just go back to old Python ones.

I didn't see any errors in the Issues tab. I had to rebuild a Server Snapshot to “fix” this problem. So it wasn't just the statistics that were affected.

williamdes commented 6 months ago

Okay so I'm able to replicate this issue on my instance (24.3.0). What happen is that Sentry does accept transaction/errors/profiles/replays/attachments data, but it doesn't record it on the statistics. So your stats of ingested events might be displayed as is there were no events being recorded, but actually the events are there -- it's processed by Snuba and you can view it on the web UI.

Looks like that confirms my Clickhouse stats image image

But the workers seem to exit for some odd reasons. So I have to restart them when the stats show that there is no row inserted.

le0pard commented 6 months ago

If it's caused by rust-consumers, then we might rollback the usage of rust-consumer and just go back to old Python ones.

It is not rust consumers, because I am had same issues with 24.2.0 version, which had python consumers

hubertdeng123 commented 6 months ago

As another data point, it appears that our Sentry instance is correctly ingesting events. However, the Stats page is showing 0 accepted/filtered/dropped since the day that rust consumers were merged into master

hubertdeng123 commented 6 months ago

Hopefully this PR solves the Stats page issue.

https://github.com/getsentry/self-hosted/pull/2908

le0pard commented 6 months ago

@hubertdeng123 errors also is not show on sentry UI after some time (gaps with no data - after this I restart server and it start accept again)

Screenshot 2024-03-20 at 00 49 25 Screenshot 2024-03-20 at 00 48 09

And same issue with 24.2.0. I was migrated to 24.3.0 thinking rust consumers will fix this issue, but looks like nope

aldy505 commented 6 months ago

Chatted with Hubert over Discord, he's saying that on Sentry's selfhosted dogfood instance (probably self-hosted.getsentry.dev) they're able to ingest events, and it's showing, but the stats is 0 since the rust-consumer is merged to master (they uses nightly).

Quoting:

I've followed up with the owners of the rust-consumers for more details, on our instance it seems to correspond to the exact days where rust-consumers were merged to master

Might need some time to find out what's wrong.

aldy505 commented 6 months ago

@hubertdeng123 errors also is not show on sentry UI after some time (gaps with no data - after this I restart server and it start accept again)

Screenshot 2024-03-20 at 00 49 25 Screenshot 2024-03-20 at 00 48 09

And same issue with 24.2.0. I was migrated to 24.3.0 thinking rust consumers will fix this issue, but looks like nope

Could it be that you have Kafka hiccups on your machine? Can you see topic lags on your Kafka container?

le0pard commented 6 months ago

I did not have this issues until latest updates to new versions (no hardware change from my side). I see, that latest versions increase number of consumers

Initially I was thinking it happening because of java defunct processes, which grow over time https://github.com/getsentry/self-hosted/issues/2567#issuecomment-1997858542 , but reverting helfcheck to previous version removed this defunct java processes, but after this stop showing transactions and errors happend again yesterday.

I am ready to check if Kafka have hickups, just provide me with commands ) @aldy505

BTW on CPU I don't see big load, maybe not enough memory (16Gbon machine), but this amount close to 100% consumed when docker compose start everything

Screenshot_20240320-090543 Screenshot_20240320-090616

williamdes commented 6 months ago

It is not rust consumers

For the exception reported on the issue post it is due to the rust workers. Does anyone know why the python workers did not have this too quick http closing error?

DarkByteZero commented 6 months ago

I have the same error, with 24.3.0 sentry is nonfunctional, i will revert to backup now.

2024.03.25 13:15:44.444996 [ 47 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
0. Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x13c4ee8e in /usr/bin/clickhouse
1. Poco::Net::SocketImpl::peerAddress() @ 0x13c510d6 in /usr/bin/clickhouse
2. DB::HTTPServerRequest::HTTPServerRequest(std::__1::shared_ptr<DB::Context const>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x110e6f0b in /usr/bin/clickhouse
3. DB::HTTPServerConnection::run() @ 0x110e5d6e in /usr/bin/clickhouse
4. Poco::Net::TCPServerConnection::start() @ 0x13c5614f in /usr/bin/clickhouse
5. Poco::Net::TCPServerDispatcher::run() @ 0x13c57bda in /usr/bin/clickhouse
6. Poco::PooledThread::run() @ 0x13d89e59 in /usr/bin/clickhouse
7. Poco::ThreadImpl::runnableEntry(void*) @ 0x13d860ea in /usr/bin/clickhouse
8. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
9. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
 (version 21.8.13.1.altinitystable (altinity build))
2024.03.25 13:15:45.469271 [ 47 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
0. Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x13c4ee8e in /usr/bin/clickhouse
1. Poco::Net::SocketImpl::peerAddress() @ 0x13c510d6 in /usr/bin/clickhouse
2. DB::HTTPServerRequest::HTTPServerRequest(std::__1::shared_ptr<DB::Context const>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x110e6f0b in /usr/bin/clickhouse
3. DB::HTTPServerConnection::run() @ 0x110e5d6e in /usr/bin/clickhouse
4. Poco::Net::TCPServerConnection::start() @ 0x13c5614f in /usr/bin/clickhouse
5. Poco::Net::TCPServerDispatcher::run() @ 0x13c57bda in /usr/bin/clickhouse
6. Poco::PooledThread::run() @ 0x13d89e59 in /usr/bin/clickhouse
7. Poco::ThreadImpl::runnableEntry(void*) @ 0x13d860ea in /usr/bin/clickhouse
8. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
9. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
 (version 21.8.13.1.altinitystable (altinity build))

EDIT: switching to non rust consumers, fixed the problem. But the stats view is broken now

aldy505 commented 6 months ago

I have the same error, with 24.3.0 sentry is nonfunctional, i will revert to backup now.

2024.03.25 13:15:44.444996 [ 47 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
0. Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x13c4ee8e in /usr/bin/clickhouse
1. Poco::Net::SocketImpl::peerAddress() @ 0x13c510d6 in /usr/bin/clickhouse
2. DB::HTTPServerRequest::HTTPServerRequest(std::__1::shared_ptr<DB::Context const>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x110e6f0b in /usr/bin/clickhouse
3. DB::HTTPServerConnection::run() @ 0x110e5d6e in /usr/bin/clickhouse
4. Poco::Net::TCPServerConnection::start() @ 0x13c5614f in /usr/bin/clickhouse
5. Poco::Net::TCPServerDispatcher::run() @ 0x13c57bda in /usr/bin/clickhouse
6. Poco::PooledThread::run() @ 0x13d89e59 in /usr/bin/clickhouse
7. Poco::ThreadImpl::runnableEntry(void*) @ 0x13d860ea in /usr/bin/clickhouse
8. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
9. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
 (version 21.8.13.1.altinitystable (altinity build))
2024.03.25 13:15:45.469271 [ 47 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
0. Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x13c4ee8e in /usr/bin/clickhouse
1. Poco::Net::SocketImpl::peerAddress() @ 0x13c510d6 in /usr/bin/clickhouse
2. DB::HTTPServerRequest::HTTPServerRequest(std::__1::shared_ptr<DB::Context const>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x110e6f0b in /usr/bin/clickhouse
3. DB::HTTPServerConnection::run() @ 0x110e5d6e in /usr/bin/clickhouse
4. Poco::Net::TCPServerConnection::start() @ 0x13c5614f in /usr/bin/clickhouse
5. Poco::Net::TCPServerDispatcher::run() @ 0x13c57bda in /usr/bin/clickhouse
6. Poco::PooledThread::run() @ 0x13d89e59 in /usr/bin/clickhouse
7. Poco::ThreadImpl::runnableEntry(void*) @ 0x13d860ea in /usr/bin/clickhouse
8. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
9. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
 (version 21.8.13.1.altinitystable (altinity build))

EDIT: switching to non rust consumers, fixed the problem. But the stats view is broken now

@DarkByteZero for the broken stats view, try to add one more snuba consumer from this PR https://github.com/getsentry/self-hosted/pull/2909

DarkByteZero commented 6 months ago

After using the new snuba consumer from https://github.com/getsentry/self-hosted/pull/2909 and reverting to the python consumer, everything is working now. My statistics view is now complete again, even retroactively.

barisyild commented 5 months ago

I have the same problem and sentry no longer accepts new issues.

williamdes commented 5 months ago

I have the same problem and sentry no longer accepts new issues.

I did some changes on my relay, to add more caching Are you sure the relay is accepting enveloppes?

barisyild commented 5 months ago

I have the same problem and sentry no longer accepts new issues.

I did some changes on my relay, to add more caching Are you sure the relay is accepting enveloppes?

All of a sudden my server stopped processing the issuances and started giving the error "envelope buffer capacity exceeded" and after restarting a few times it got better but I'm not sure if it's related to this issue or not.

williamdes commented 5 months ago

same error, that's why I found https://github.com/getsentry/self-hosted/issues/1929#issuecomment-1404785112 and applied the setting with some other ones maybe also try this setting

combrs commented 5 months ago

same error on just installed 24.4.1

# git log --oneline -1
2fe5499 (HEAD, tag: 24.4.1) release: 24.4.1

only git checkout, run install.sh , docker compose up -d no projects, no connected clients, no ingestions at all and after ~1 day (no-op run) got clickhouse logs of this size:

-rw-r----- 1 systemd-network systemd-journal 687M Apr 26 14:07 clickhouse-server.err.log
-rw-r----- 1 systemd-network systemd-journal 687M Apr 26 14:07 clickhouse-server.log

Flooded with errors as in UPD in issue description. If start docker-compose without -d , first appearance of this clickhouse error is here

snuba-generic-metrics-sets-consumer-1           | {"timestamp":"2024-04-26T11:25:10.171809Z","level":"INFO","fields":{"message":"Starting Rust consumer","consumer_config.storages":"[StorageConfig { name: \"generic_metrics_sets_raw\", clickhouse_table_name: \"generic_metric_sets_raw_local\", clickhouse_cluster: ClickhouseConfig { host: \"clickhouse\", port: 9000, http_port: 8123, user: \"default\", password: \"\", database: \"default\" }, message_processor: MessageProcessorConfig { python_class_name: \"GenericSetsMetricsProcessor\", python_module: \"snuba.datasets.processors.generic_metrics_processor\" } }]"},"target":"rust_snuba::consumer"}
snuba-generic-metrics-sets-consumer-1           | {"timestamp":"2024-04-26T11:25:10.171916Z","level":"INFO","fields":{"message":"Starting consumer for \"generic_metrics_sets_raw\"","storage":"generic_metrics_sets_raw"},"target":"rust_snuba::consumer"}
snuba-transactions-consumer-1                   | {"timestamp":"2024-04-26T11:25:10.208491Z","level":"INFO","fields":{"message":"New partitions assigned: {Partition { topic: Topic(\"transactions\"), index: 0 }: 0}"},"target":"rust_arroyo::processing"}
snuba-generic-metrics-distributions-consumer-1  | {"timestamp":"2024-04-26T11:25:10.554345Z","level":"INFO","fields":{"message":"skipping write of 0 rows"},"target":"rust_snuba::strategies::clickhouse"}
clickhouse-1                                    | 2024.04.26 11:25:10.562495 [ 237 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1                                    |
clickhouse-1                                    | 0. Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x13c4ee8e in /usr/bin/clickhouse
clickhouse-1                                    | 1. Poco::Net::SocketImpl::peerAddress() @ 0x13c510d6 in /usr/bin/clickhouse
clickhouse-1                                    | 2. DB::HTTPServerRequest::HTTPServerRequest(std::__1::shared_ptr<DB::Context const>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x110e6f0b in /usr/bin/clickhouse
clickhouse-1                                    | 3. DB::HTTPServerConnection::run() @ 0x110e5d6e in /usr/bin/clickhouse

Replace 'rust-consumer' with 'consumer' (as in comments to issues somewhere) "solves" the problem without need to full version downgrade.

tmpkn commented 5 months ago

I confirm that the fix mentioned by @combrs works:

%s/rust-consumer/consumer/g on your docker-compose.yaml and the problem goes away

le0pard commented 5 months ago

Screenshot 2024-04-27 at 14 09 36 Screenshot 2024-04-27 at 13 48 59

This is what happening in my case - jump in load to 75 (no big amount errors or transactions at this time), small drop in memory (some process died?) and continuous grow in memory after this (looks like some consumer for redis dead and redis start eating memory)

williamdes commented 5 months ago

For reference I started using keydb (by Snapchat) to replace redis, pretty much out of the box replacement. Seems like it works great. Redis writes pretty often. Did someone tweak Redis?

le0pard commented 5 months ago

No tweaks for redis. It is not redis issue - transaction events pushed to redis, but looks like consumers dead, that is why memory growing. It is effect of the problem, not a reason

williamdes commented 5 months ago
2024.04.29 10:29:52.619697 [ 3391719 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):

And in Rust workers {"timestamp":"2024-04-29T09:44:59.918768Z","level":"INFO","fields":{"message":"skipping write of XYZ rows"},"target":"rust_snuba::strategies::clickhouse"}

I am really not sure why this occurs

Edit: this changed with https://github.com/getsentry/snuba/pull/5838

williamdes commented 5 months ago

Here is: relay/config.yml

# See: https://github.com/getsentry/sentry-docs/blob/master/docs/product/relay/options.mdx

relay:
  upstream: "http://web:9000/"
  host: 0.0.0.0
  port: 3000
  override_project_ids: false
logging:
  level: WARN
http:
  timeout: 60
  connection_timeout: 60
  max_retry_interval: 60
  host_header: 'sentry.xxxxxx.xx'
processing:
  enabled: true
  kafka_config:
    - {name: "bootstrap.servers", value: "kafka:9092"}
    - {name: "message.max.bytes", value: 50000000} # 50MB
  redis: redis://redis:6379
  geoip_path: "/geoip/GeoLite2-City.mmdb"
cache:
  envelope_buffer_size: 80000  # queue up to 1 million requests
  eviction_interval: 120
  project_grace_period: 3600 # One hour
  envelope_expiry: 1200 # 20 minutes
  batch_interval: 300
  file_interval: 30
spool:
  envelopes:
#    path: /var/lib/sentry-envelopes/files
    max_memory_size: 1GB
    max_disk_size: 4GB
    max_connections: 20
    min_connections: 10

You will notice the jump from 2-3K lines/s to 13K lines/s. Was Sentry struggling to follow ? Now it all feels waaaay more stable than before. Maybe this was the relay crashing.

image

williamdes commented 4 months ago

I noticed that restarting all containers does some kind of state reset. But after some time the relay seems unable to handle it's own load: envelope buffer capacity exceeded since the config is missing. i suspect the root cause to be this error in the relay logs:

can't fetch project states failure_duration="42716 seconds" backoff_attempts=7
relay-1  | 2024-05-11T10:53:22.079788Z ERROR relay_server::services::project_upstream: error fetching project states error=upstream request returned error 500 Internal Server Error

I would be very happy if anyone had a clue of where to search.

The web logs report 200 OK in their logs. Full relay config above: https://github.com/getsentry/self-hosted/issues/2876#issuecomment-2084399019 I did change a bit the http block to try to mitigate this issue:

http:
  timeout: 60
  connection_timeout: 60
  max_retry_interval: 60

In the cron logs I did find:

celery.beat.SchedulingError: Couldn't apply scheduled task deliver-from-outbox-control: Command # 1 (LLEN outbox.control) of pipeline caused error: OOM command not allowed when used memory > 'maxmemory

Seems that the Docker host needs vm.overcommit_memory sysctl for Redis: https://github.com/docker-library/redis/issues/298#issuecomment-972275132

Plus maxmemory of Redis/Keydb was too low

But

ERROR relay_server::services::project_upstream: error fetching project state cbb152173d0b4451b3453b05b58dddee: deadline exceeded errors=0 pending=88 tags.did_error=false tags.was_pending=true tags.project_key="xxxxxx"

persists

Seems like this was previously reported as https://github.com/getsentry/self-hosted/issues/1929


With the help of this incredible tcpdump command (https://stackoverflow.com/a/16610385/5155484) I managed to see the reply web did:

{"configs":{},"pending":["cbb152173d0b4451b3453b05b58dddee","084e50cc07ad4b9f862a3595260d7aa1"]}

Request: POST /api/0/relays/projectconfigs/?version=3 HTTP/1.1

{"publicKeys":["cbb152173d0b4451b3453b05b58dddee","084e50cc07ad4b9f862a3595260d7aa1"],"fullConfig":true,"noCache":false}
khassad commented 4 months ago

Hi, we have the same kind of issues,

As another data point, it appears that our Sentry instance is correctly ingesting events. However, the Stats page is showing 0 accepted/filtered/dropped since the day that rust consumers were merged into master

Hi, same issue here, I can see transactions ingested but nothing in stats, be it in projects or other aggregated view.

fmiqbal commented 4 months ago

I encounter relay problem the other day because of envelope buffer capacity exceeded and then my disk is full, and I usually resort to nuking option, that is delete kafka and zookeeper, but it doesnt immediately work, the relay still spewing bunch of timeout error, then after some digging I realize there is also possibly problem with redis, so I check redis and the rdb size is 2GB, I delete that, restart everything, and it works again ..

I still dont know the initial problem tho

khassad commented 4 months ago

After upgrading to 24.4.2 I get back some stats on Performance and Profiling pages. But still nothing on the Projects dashboard, individual projects and related views.

khassad commented 4 months ago

I confirm that the fix mentioned by @combrs works:

%s/rust-consumer/consumer/g on your docker-compose.yaml and the problem goes away

Unfortunately this did not help in our case. We don't have any transaction or session info in Projects-related views.

msxdan commented 4 months ago

I confirm that the fix mentioned by @combrs works: %s/rust-consumer/consumer/g on your docker-compose.yaml and the problem goes away

Unfortunately this did not help in our case. We don't have any transaction or session info in Projects-related views.

That worked in our case, but it took like 1 day before it started showing something in performance.

williamdes commented 4 months ago

Feel free to join my Discord thread: https://discord.com/channels/621778831602221064/1243470946904445008 Discord join link: https://discord.gg/TvshEMuG For now I found a solution that works pretty well. When the node starts complaining about the config I run the following: docker exec -it sentry-self-hosted-redis-1 sh -c 'redis-cli DEL relay_config'

Then do a docker compose down and docker compose up. Please let me know if this works.

liukch commented 3 months ago

I think this PR https://github.com/getsentry/self-hosted/pull/2908 would fix this issue,I don't know why it is still not be merged for a long time.

williamdes commented 3 months ago

I think this PR #2908 would fix this issue,I don't know why it is still not be merged for a long time.

Reverting to old Python code is not a solution. Fix the Rust code is a good one.

rojinebrahimi commented 3 months ago

Hey everyone! I reverted the consumers to python but unfortunately I am still not able to see my transactions. Besides, I have come across some issues on ClickHouse:

2024.07.07 11:11:58.974263 [ 19750 ] {7760523f29425408e575ceb5fbd61469} <Error> TCPHandler: Code: 46. DB::Exception: Unknown function notHandled: While processing ((client_timestamp AS _snuba_timestamp) >= toDateTime('2024-06-27T11:11:58', 'Universal')) AND (_snuba_timestamp < toDateTime('2024-07-07T11:11:55', 'Universal')) AND ((project_id AS _snuba_project_id) IN tuple(17)) AND (notHandled() = 1) AND ((occurrence_type_id AS _snuba_occurrence_type_id) IN (4001, 4002, 4003)) AND (_snuba_project_id IN tuple(17)) AND ((group_id AS _snuba_group_id) IN (187756,...

Does anyone possibly know the solution or the root cause of this problem?