getsentry / self-hosted

Sentry, feature-complete and packaged up for low-volume deployments and proofs-of-concept
https://develop.sentry.dev/self-hosted/
Other
7.94k stars 1.78k forks source link

24.8 + 24.9 update eats up all ressources, connection issues, python errors, .... #3377

Open romanstingler opened 1 month ago

romanstingler commented 1 month ago

Self-Hosted Version

24.9

CPU Architecture

x86_64

Docker Version

25.0.2

Docker Compose Version

2.24.5

Steps to Reproduce

Sentry for real ?? Image

romanstingler commented 1 month ago

nice it runs for some time and then it just dies and i can't access the container anymore until reboot. Image

I am getting really frustrated, since 24.3 not a single updated version which wasn't broken.

Image even 64gb of ram is not working

kafka-1                                         | [2024-10-10 14:10:15,832] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
kafka-1                                         | [2024-10-10 14:10:25,815] INFO Updated connection-accept-rate max connection creation rate to 2147483647 (kafka.network.ConnectionQuotas)
kafka-1                                         | [2024-10-10 14:10:26,353] INFO [SocketServer listenerType=CONTROLLER, nodeId=1001] Created data-plane acceptor and processors for endpoint : ListenerName(CONTROLLER) (kafka.network.SocketServer)
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
transactions-consumer-1                         |     return f(get_current_context(), *args, **kwargs)
transactions-consumer-1                         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/sentry/runner/decorators.py", line 28, in inner
transactions-consumer-1                         |     configure()
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/sentry/runner/__init__.py", line 126, in configure
transactions-consumer-1                         |     _configure(ctx, py, yaml, skip_service_validation)
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/sentry/runner/settings.py", line 148, in configure
transactions-consumer-1                         |     initialize_app(
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/sentry/runner/initializer.py", line 396, in initialize_app
transactions-consumer-1                         |     setup_services(validate=not skip_service_validation)
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/sentry/runner/initializer.py", line 448, in setup_services
transactions-consumer-1                         |     service.validate()
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/sentry/utils/services.py", line 117, in <lambda>
transactions-consumer-1                         |     context[key] = (lambda f: lambda *a, **k: getattr(self, f)(*a, **k))(key)
transactions-consumer-1                         |                                               ^^^^^^^^^^^^^^^^^^^^^^^^^
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/sentry/buffer/redis.py", line 102, in validate
transactions-consumer-1                         |     validate_dynamic_cluster(self.is_redis_cluster, self.cluster)
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/sentry/utils/redis.py", line 293, in validate_dynamic_cluster
transactions-consumer-1                         |     raise InvalidConfiguration(str(e))
transactions-consumer-1                         | sentry.exceptions.InvalidConfiguration: Error -3 connecting to redis:6379. Temporary failure in name resolution.
transactions-consumer-1                         | Updating certificates in /etc/ssl/certs...
transactions-consumer-1                         | 0 added, 0 removed; done.
transactions-consumer-1                         | Running hooks in /etc/ca-certificates/update.d...
transactions-consumer-1                         | done.
transactions-consumer-1                         | Traceback (most recent call last):
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 552, in connect
transactions-consumer-1                         |     sock = self._connect()
transactions-consumer-1                         |            ^^^^^^^^^^^^^^^
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 578, in _connect
transactions-consumer-1                         |     for res in socket.getaddrinfo(self.host, self.port, self.socket_type,
transactions-consumer-1                         |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/socket.py", line 962, in getaddrinfo
transactions-consumer-1                         |     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
transactions-consumer-1                         |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
transactions-consumer-1                         | socket.gaierror: [Errno -3] Temporary failure in name resolution
transactions-consumer-1                         | 
transactions-consumer-1                         | During handling of the above exception, another exception occurred:
transactions-consumer-1                         | 
transactions-consumer-1                         | Traceback (most recent call last):
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/sentry/utils/redis.py", line 288, in validate_dynamic_cluster
transactions-consumer-1                         |     client.ping()
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/redis/client.py", line 1351, in ping
transactions-consumer-1                         |     return self.execute_command('PING')
transactions-consumer-1                         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/redis/__init__.py", line 235, in sentry_patched_execute_command
transactions-consumer-1                         |     return old_execute_command(self, name, *args, **kwargs)
transactions-consumer-1                         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/rb/clients.py", line 488, in execute_command
transactions-consumer-1                         |     buf = self._get_command_buffer(host_id, args[0])
transactions-consumer-1                         |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/rb/clients.py", line 355, in _get_command_buffer
transactions-consumer-1                         |     buf = CommandBuffer(host_id, connect, self.auto_batch)
transactions-consumer-1                         |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
transactions-consumer-1                         |   File "/usr/local/lib/python3.11/site-packages/rb/clients.py", line 91, in __init__
transactions-consumer-1                         |     self.connect()
transactions-consumer-1                         |   File "/usr/local^C
snuba-generic-metrics-sets-consumer-1           | %3|1718627561.014|FAIL|rdkafka#producer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.33:9092 failed: Connection refused (after 24ms in state CONNECT)
snuba-errors-consumer-1                         | {"timestamp":"2024-10-10T14:11:03.454538Z","level":"INFO","fields":{"message":"Starting consumer for \"errors\"","storage":"errors"},"target":"rust_snuba::consumer"}
worker-1                                        | 13:42:40 [WARNING] py.warnings: /.venv/lib/python3.12/site-packages/celery/app/utils.py:203: CDeprecationWarning: 
subscription-consumer-metrics-1                 |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke
snuba-generic-metrics-counters-consumer-1       | {"timestamp":"2024-03-20T13:27:03.227427Z","level":"INFO","fields":{"message":"Inserted 6 rows"},"target":"rust_snuba::strategies::clickhouse"}
web-1                                           |   File "/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 789, in urlopen
subscription-consumer-transactions-1            | 18:56:50 [INFO] arroyo.processing.processor: Stopped
billing-metrics-consumer-1                      |     return self.main(*args, **kwargs)
clickhouse-1                                    | 5. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse
post-process-forwarder-errors-1                 |                      ^^^^^^
snuba-profiling-functions-consumer-1            | {"timestamp":"2024-09-25T09:45:05.275491Z","level":"INFO","fields":{"message":"Inserted 6 rows"},"target":"rust_snuba::strategies::clickhouse"}
snuba-generic-metrics-distributions-consumer-1  | {"timestamp":"2024-03-20T12:36:48.012504Z","level":"INFO","fields":{"message":"Inserted 11 rows"},"target":"rust_snuba::strategies::clickhouse"}
generic-metrics-consumer-1                      | 12:33:48 [INFO] arroyo.processing.processor: New partitions assigned: {Partition(topic=Topic(name='ingest-performance-metrics'), index=0): 8395457}
attachments-consumer-1                          |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
snuba-replays-consumer-1                        | 2024-10-08 19:01:40,847 Snuba initialization took 7.816855292767286s
subscription-consumer-events-1                  |   File "/usr/src/sentry/src/sentry/utils/lazy_service_wrapper.py", line 117, in <lambda>
subscription-consumer-generic-metrics-1         | %3|1718627585.325|FAIL|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.33:9092 failed: Connection refused (after 0ms in state CONNECT, 30 identical error(s) suppressed)
ingest-profiles-1                               |     return f(get_current_context(), *args, **kwargs)
post-process-forwarder-transactions-1           |            ^^^^^^^^^^^^^^^^^^^^^^^^^^
snuba-subscription-consumer-metrics-1           |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
post-process-forwarder-issue-platform-1         | 12:38:03 [INFO] sentry.post_process_forwarder.post_process_forwarder: Starting multithreaded post process forwarder
snuba-subscription-consumer-transactions-1      |   File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/connection.py", line 395, in connect
snuba-transactions-consumer-1                   | {"timestamp":"2024-03-20T13:12:18.811646Z","level":"INFO","fields":{"message":"Inserted 1 rows"},"target":"rust_snuba::strategies::clickhouse"}
snuba-profiling-profiles-consumer-1             | 2024-04-26 13:59:58,506 Snuba initialization took 19.744962955999995s
snuba-issue-occurrence-consumer-1               | {"timestamp":"2024-04-24T21:28:46.810069Z","level":"INFO","fields":{"message":"Inserted 1 rows"},"target":"rust_snuba::strategies::clickhouse"}
snuba-metrics-consumer-1                        | {"timestamp":"2024-04-10T10:05:25.922281Z","level":"INFO","fields":{"message":"Inserted 2 rows"},"target":"rust_snuba::strategies::clickhouse"}
kafka-1                                         | [2024-10-10 14:10:33,239] INFO Initialized snapshots with IDs SortedSet(OffsetAndEpoch(offset=15524, epoch=6), OffsetAndEpoch(offset=22723, epoch=6), OffsetAndEpoch(offset=29922, epoch=6), OffsetAndEpoch(offset=37121, epoch=6), OffsetAndEpoch(offset=44320, epoch=6), OffsetAndEpoch(offset=51519, epoch=6), OffsetAndEpoch(offset=58718, epoch=6), OffsetAndEpoch(offset=65917, epoch=6), OffsetAndEpoch(offset=73116, epoch=6), OffsetAndEpoch(offset=80315, epoch=6), OffsetAndEpoch(offset=87513, epoch=6), OffsetAndEpoch(offset=94712, epoch=6), OffsetAndEpoch(offset=101910, epoch=6), OffsetAndEpoch(offset=109109, epoch=6), OffsetAndEpoch(offset=116308, epoch=6), OffsetAndEpoch(offset=123507, epoch=6), OffsetAndEpoch(offset=130706, epoch=6), OffsetAndEpoch(offset=137898, epoch=6), OffsetAndEpoch(offset=145084, epoch=6), OffsetAndEpoch(offset=149344, epoch=6), OffsetAndEpoch(offset=149489, epoch=6), OffsetAndEpoch(offset=152198, epoch=6), OffsetAndEpoch(offset=152918, epoch=6), OffsetAndEpoch(offset=153562, epoch=6)) from /var/lib/kafka/data/__cluster_metadata-0 (kafka.raft.KafkaMetadataLog$)
events-consumer-1                               | 13:48:20 [INFO] arroyo.processing.processor: Partition revocation complete.
nginx-1                                         | 35.191.220.51 - - [10/Oct/2024:13:53:49 +0000] "GET /_health/ HTTP/1.1" 499 0 "-" "GoogleHC/1.0" "-"
snuba-subscription-consumer-events-1            | %3|1728568010.229|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.47:9092 failed: Connection refused (after 1ms in state CONNECT)
smtp-1                                          |    46          will write message using CHUNKING
redis-1                                         | 1:M 10 Oct 2024 14:13:31.715 * Background saving terminated with success
ingest-occurrences-1                            |   File "/usr/local/lib/python3.11/site-packages/rb/clients.py", line 91, in __init__
postgres-1                                      | 2024-10-10 14:18:52.110 UTC [23091] DETAIL:  Key (project_id, environment_id)=(21, 3) already exists.
ingest-replay-recordings-1                      |   File "/usr/local/lib/python3.11/site-packages/rb/clients.py", line 254, in get_connection
metrics-consumer-1                              |            ^^^^^^^^^^^^^^^
ingest-monitors-1                               | 13:31:00 [INFO] sentry: monitors.consumer.clock_tick (reference_datetime='2024-03-20 13:31:00+00:00')
cron-1                                          |     return __callback(*args, **kwargs)

''' snuba-replays-consumer-1 | {"timestamp":"2024-10-08T18:54:58.511498Z","level":"INFO","fields":{"message":"Starting Rust consumer","consumer_config":"ConsumerConfig { storages: [StorageConfig { name: \"replays\", clickhouse_table_name: \"replays_local\", clickhouse_cluster: ClickhouseConfig { host: \"clickhouse\", port: 9000, http_port: 8123, user: \"default\", password: \"\", database: \"default\" }, message_processor: MessageProcessorConfig { python_class_name: \"ReplaysProcessor\", python_module: \"snuba.datasets.processors.replays_processor\" } }], raw_topic: TopicConfig { physical_topic_name: \"ingest-replay-events\", logical_topic_name: \"ingest-replay-events\", broker_config: {\"bootstrap.servers\": \"kafka:9092\", \"security.protocol\": \"plaintext\", \"queued.max.messages.kbytes\": \"10000\", \"queued.min.messages\": \"10000\"} }, commit_log_topic: None, replacements_topic: None, dlq_topic: Some(TopicConfig { physical_topic_name: \"snuba-dead-letter-replays\", logical_topic_name: \"snuba-dead-letter-replays\", broker_config: {\"security.protocol\": \"plaintext\", \"bootstrap.servers\": \"kafka:9092\"} }), accountant_topic: TopicConfig { physical_topic_name: \"shared-resources-usage\", logical_topic_name: \"shared-resources-usage\", broker_config: {\"security.protocol\": \"plaintext\", \"bootstrap.servers\": \"kafka:9092\"} }, max_batch_size: 50000, max_batch_time_ms: 750, env: EnvConfig { sentry_dsn: None, dogstatsd_host: None, dogstatsd_port: None, default_retention_days: 90, lower_retention_days: 30, valid_retention_days: {90, 30}, record_cogs: false, ddm_metrics_sample_rate: 0.01, project_stacktrace_blacklist: [] } }"},"target":"rust_snuba::consumer"} snuba-replays-consumer-1 | {"timestamp":"2024-10-08T18:54:58.512300Z","level":"INFO","fields":{"message":"Starting consumer for \"replays\"","storage":"replays"},"target":"rust_snuba::consumer"} snuba-replays-consumer-1 | %3|1728413698.534|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Failed to resolve 'kafka:9092': Name or service not known (after 15ms in state CONNECT) snuba-replays-consumer-1 | {"timestamp":"2024-10-08T18:54:58.534271Z","level":"ERROR","fields":{"message":"librdkafka: Global error: Resolve (Local: Host resolution failure): kafka:9092/bootstrap: Failed to resolve 'kafka:9092': Name or service not known (after 15ms in state CONNECT)"},"target":"rdkafka::client"} snuba-replays-consumer-1 | {"timestamp":"2024-10-08T18:54:58.534507Z","level":"WARN","fields":{"message":"Ignored event 'Error' on base producer poll"},"target":"rdkafka::producer::base_producer"} snuba-replays-consumer-1 | {"timestamp":"2024-10-08T18:54:58.534531Z","level":"ERROR","fields":{"message":"librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down"},"target":"rdkafka::client"} snuba-replays-consumer-1 | {"timestamp":"2024-10-08T18:54:58.534543Z","level":"WARN","fields":{"message":"Ignored event 'Error' on base producer poll"},"target":"rdkafka::producer::base_producer"} snuba-replays-consumer-1 | %3|1728413698.534|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Failed to resolve 'kafka:9092': Name or service not known (after 8ms in state CONNECT) snuba-replays-consumer-1 | {"timestamp":"2024-10-08T18:54:58.534939Z","level":"ERROR","fields":{"message":"librdkafka: Global error: Resolve (Local: Host resolution failure): kafka:9092/bootstrap: Failed to resolve 'kafka:9092': Name or service not known (after 8ms in state CONNECT)","error":"Global error: Resolve (Local: Host resolution failure)"},"target":"rust_arroyo::backends::kafka"} snuba-replays-consumer-1 | {"timestamp":"2024-10-08T18:54:58.535118Z","level":"ERROR","fields":{"message":"poll error","error":"Message consumption error: Resolve (Local: Host resolution failure)"},"target":"rust_arroyo::processing"} snuba-replays-consumer-1 | {"timestamp":"2024-10-08T18:54:58.535420Z","level":"ERROR","fields":{"message":"librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down","error":"Global error: AllBrokersDown (Local: All broker connections are down)"},"target":"rust_arroyo::backends::kafka"} snuba-replays-consumer-1 | {"timestamp":"2024-10-08T18:54:58.635425Z","level":"ERROR","fields":{"error":"poll error"},"target":"rust_snuba::consumer"} ''' just connection error within the 100 docker containers that you just throw in

clickhouse-1 | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse clickhouse-1 | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse clickhouse-1 | 2. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x0000000013154417 in /usr/bin/clickhouse clickhouse-1 | 3. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse clickhouse-1 | 4. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse clickhouse-1 | 5. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse clickhouse-1 | 6. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse clickhouse-1 | 7. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse clickhouse-1 | 8. ? @ 0x00007fd853293609 in ? clickhouse-1 | 9. ? @ 0x00007fd8531b8353 in ? clickhouse-1 | (version 23.8.11.29.altinitystable (altinity build)) clickhouse-1 | 2024.10.10 15:39:41.687639 [ 49 ] {} ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):

kafka-1                                         | [2024-10-10 15:26:14,159] ERROR Encountered metadata publishing fault: Error deleting stray partitions during startup (org.apache.kafka.server.fault.LoggingFaultHandler)
kafka-1                                         | java.lang.RuntimeException: The log dir Log(dir=/var/lib/kafka/data/snuba-dead-letter-metrics-distributions-0, topic=snuba-dead-letter-metrics-distributions, partition=0, highWatermark=0, lastStableOffset=0, logStartOffset=0, logEndOffset=0) does not have a topic ID, which is not allowed when running in KRaft mode.
kafka-1                                         |       at kafka.server.metadata.BrokerMetadataPublisher$.$anonfun$findStrayPartitions$2(BrokerMetadataPublisher.scala:76)
kafka-1                                         |       at scala.Option.getOrElse(Option.scala:201)
kafka-1                                         |       at kafka.server.metadata.BrokerMetadataPublisher$.$anonfun$findStrayPartitions$1(BrokerMetadataPublisher.scala:75)
kafka-1                                         |       at scala.collection.StrictOptimizedIterableOps.flatMap(StrictOptimizedIterableOps.scala:118)
kafka-1                                         |       at scala.collection.StrictOptimizedIterableOps.flatMap$(StrictOptimizedIterableOps.scala:105)
kafka-1                                         |       at scala.collection.mutable.ArrayBuffer.flatMap(ArrayBuffer.scala:43)
kafka-1                                         |       at kafka.server.metadata.BrokerMetadataPublisher$.findStrayPartitions(BrokerMetadataPublisher.scala:73)
kafka-1                                         |       at kafka.server.metadata.BrokerMetadataPublisher.finishInitializingReplicaManager(BrokerMetadataPublisher.scala:353)
kafka-1                                         |       at kafka.server.metadata.BrokerMetadataPublisher.onMetadataUpdate(BrokerMetadataPublisher.scala:246)
kafka-1                                         |       at org.apache.kafka.image.loader.MetadataLoader.initializeNewPublishers(MetadataLoader.java:309)
kafka-1                                         |       at org.apache.kafka.image.loader.MetadataLoader.lambda$scheduleInitializeNewPublishers$0(MetadataLoader.java:266)
kafka-1                                         |       at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
kafka-1                                         |       at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
kafka-1                                         |       at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
kafka-1                                         |       at java.base/java.lang.Thread.run(Thread.java:829)
romanstingler commented 1 month ago

issue are your rust-consumers, please take your time to fix AND TEST this !!!!

https://github.com/getsentry/self-hosted/issues/2974#issuecomment-2076560194

bc-sentry commented 2 weeks ago

Could you please test with 24.11?