getsentry / self-hosted

Sentry, feature-complete and packaged up for low-volume deployments and proofs-of-concept
https://develop.sentry.dev/self-hosted/
Other
7.49k stars 1.7k forks source link

Kafka/zookeeper fatal error when disk runs out #3133

Open sposs opened 2 weeks ago

sposs commented 2 weeks ago

Self-Hosted Version

24.500

CPU Architecture

x86_64

Docker Version

26.1.4

Docker Compose Version

2.27.1

Steps to Reproduce

Install self hosted. Run out of disk. Kafka/zookeeper will fail. Impossible to recover (see logs), my installation is doomed.

Expected Result

The service should not break to a point it cannot be recovered. Maybe check the disk and kill itself. I'd rather loose a bunch of transactions than losing everything.

Actual Result

===> Launching kafka ... 
[2024-06-17 04:55:46,504] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
[2024-06-17 04:55:47,568] INFO Starting the log cleaner (kafka.log.LogCleaner)
[2024-06-17 04:55:47,915] INFO Updated connection-accept-rate max connection creation rate to 2147483647 (kafka.network.ConnectionQuotas)
[2024-06-17 04:55:47,936] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1001] Created data-plane acceptor and processors for endpoint : ListenerName(PLAINTEXT) (kafka.network.SocketServer)
[2024-06-17 04:55:48,020] INFO Creating /brokers/ids/1001 (is it secure? false) (kafka.zk.KafkaZkClient)
[2024-06-17 04:55:48,033] INFO Stat of the created znode at /brokers/ids/1001 is: 1478,1478,1718600148028,1718600148028,1,0,0,72130214439944228,194,0,1478
 (kafka.zk.KafkaZkClient)
[2024-06-17 04:55:48,034] INFO Registered broker 1001 at path /brokers/ids/1001 with addresses: PLAINTEXT://kafka:9092, czxid (broker epoch): 1478 (kafka.zk.KafkaZkClient)
[2024-06-17 04:55:48,242] INFO [/config/changes-event-process-thread]: Starting (kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread)
[2024-06-17 04:55:48,259] WARN [Controller id=1001, targetBrokerId=1001] Connection to node 1001 (kafka/172.19.0.13:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2024-06-17 04:55:48,260] WARN [RequestSendThread controllerId=1001] Controller 1001's connection to broker kafka:9092 (id: 1001 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
java.io.IOException: Connection to kafka:9092 (id: 1001 rack: null) failed.
    at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:70)
    at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:298)
    at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:251)
    at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:130)
[2024-06-17 04:55:48,341] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1001] Enabling request processing. (kafka.network.SocketServer)
[2024-06-17 04:55:48,344] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.DataPlaneAcceptor)
[2024-06-17 04:56:20,444] ERROR Error while appending records to ingest-transactions-0 in dir /var/lib/kafka/data (org.apache.kafka.storage.internals.log.LogDirFailureChannel)
java.io.IOException: No space left on device
    at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at java.base/sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62)
    at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
    at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:79)
    at java.base/sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:280)
    at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:90)
    at org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
    at kafka.log.LogSegment.append(LogSegment.scala:160)
    at kafka.log.LocalLog.append(LocalLog.scala:439)
    at kafka.log.UnifiedLog.append(UnifiedLog.scala:911)
    at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719)
    at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313)
    at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301)
    at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1277)
    at scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
    at scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
    at scala.collection.mutable.HashMap.map(HashMap.scala:35)
    at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1265)
    at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:868)
    at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:686)
    at kafka.server.KafkaApis.handle(KafkaApis.scala:180)
    at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:153)
    at java.base/java.lang.Thread.run(Thread.java:829)
[2024-06-17 04:56:20,445] WARN [ReplicaManager broker=1001] Stopping serving replicas in dir /var/lib/kafka/data (kafka.server.ReplicaManager)
[2024-06-17 04:56:20,464] WARN [ReplicaManager broker=1001] Broker 1001 stopped fetcher for partitions snuba-queries-0,outcomes-0,scheduled-subscriptions-transactions-0,events-0,cdc-0,profiles-call-tree-0,snuba-generic-metrics-sets-commit-log-0,__consumer_offsets-0,scheduled-subscriptions-events-0,outcomes-billing-0,ingest-performance-metrics-0,events-subscription-results-0,snuba-dead-letter-generic-events-0,transactions-0,snuba-dead-letter-replays-0,processed-profiles-0,snuba-dead-letter-metrics-0,snuba-attribution-0,scheduled-subscriptions-generic-metrics-distributions-0,snuba-generic-metrics-counters-commit-log-0,ingest-events-0,metrics-subscription-results-0,snuba-generic-metrics-gauges-commit-log-0,profiles-0,scheduled-subscriptions-generic-metrics-counters-0,scheduled-subscriptions-generic-metrics-sets-0,scheduled-subscriptions-generic-metrics-gauges-0,generic-metrics-subscription-results-0,snuba-transactions-commit-log-0,snuba-spans-0,ingest-replay-events-0,ingest-sessions-0,ingest-transactions-0,ingest-attachments-0,snuba-metrics-0,monitors-clock-tick-0,snuba-metrics-summaries-0,snuba-dead-letter-group-attributes-0,shared-resources-usage-0,ingest-monitors-0,ingest-occurrences-0,transactions-subscription-results-0,generic-events-0,snuba-dead-letter-generic-metrics-0,snuba-metrics-commit-log-0,ingest-metrics-0,group-attributes-0,snuba-generic-metrics-0,event-replacements-0,snuba-dead-letter-querylog-0,snuba-commit-log-0,snuba-generic-metrics-distributions-commit-log-0,ingest-replay-recordings-0,snuba-generic-events-commit-log-0,scheduled-subscriptions-metrics-0 and stopped moving logs for partitions  because they are in the failed log directory /var/lib/kafka/data. (kafka.server.ReplicaManager)
[2024-06-17 04:56:20,464] WARN Stopping serving logs in dir /var/lib/kafka/data (kafka.log.LogManager)
[2024-06-17 04:56:20,466] ERROR Shutdown broker because all log dirs in /var/lib/kafka/data have failed (kafka.log.LogManager)

And zookeepers' logs

Using log4j config /etc/kafka/log4j.properties
===> User
uid=1000(appuser) gid=1000(appuser) groups=1000(appuser)
===> Configuring ...
Running in Zookeeper mode...
===> Running preflight checks ... 
===> Check if /var/lib/kafka/data is writable ...
===> Check if Zookeeper is healthy ...
[2024-06-17 06:00:49,813] ERROR Unable to resolve address: zookeeper:2181 (org.apache.zookeeper.client.StaticHostProvider)
java.net.UnknownHostException: zookeeper: Name or service not known
    at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:930)
    at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1543)
    at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
    at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1386)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1307)
    at org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:88)
    at org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:141)
    at org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:368)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1204)
[2024-06-17 06:00:49,818] WARN Session 0x0 for server zookeeper:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. (org.apache.zookeeper.ClientCnxn)

Shutting down and restarting fails with dependency failed to start: container sentry-self-hosted-zookeeper-1 is unhealthy Reinstalling fails with

dependency failed to start: container sentry-self-hosted-zookeeper-1 is unhealthy
Error in install/bootstrap-snuba.sh:3.
'$dcr snuba-api bootstrap --no-migrate --force' exited with status 1
-> ./install.sh:main:36
--> install/bootstrap-snuba.sh:source:3

Tried to follow the troubleshooting guide

sentry@workhorse:~/self-hosted$ docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list
[+] Creating 1/0
 ✔ Container sentry-self-hosted-zookeeper-1  Created                                                                                                                                                                                               0.0s 
[+] Running 1/1
 ✔ Container sentry-self-hosted-zookeeper-1  Started                                                                                                                                                                                               0.4s 
dependency failed to start: container sentry-self-hosted-zookeeper-1 is unhealthy

Tried the nuclear option

sentry@workhorse:~/self-hosted$ docker compose down --volumes
[+] Running 13/13
 ✔ Container sentry-self-hosted-kafka-1             Removed                                                                                                                                                                                        0.0s 
 ✔ Container sentry-self-hosted-clickhouse-1        Removed                                                                                                                                                                                        0.0s 
 ✔ Container sentry-self-hosted-redis-1             Removed                                                                                                                                                                                        0.0s 
 ✔ Container sentry-self-hosted-zookeeper-1         Removed                                                                                                                                                                                        0.1s 
 ✔ Volume sentry-self-hosted_sentry-clickhouse-log  Removed                                                                                                                                                                                        0.0s 
 ✔ Volume sentry-self-hosted_sentry-vroom           Removed                                                                                                                                                                                        0.4s 
 ✔ Volume sentry-self-hosted_sentry-secrets         Removed                                                                                                                                                                                        0.0s 
 ✔ Volume sentry-self-hosted_sentry-kafka-log       Removed                                                                                                                                                                                        0.4s 
 ✔ Volume sentry-self-hosted_sentry-smtp            Removed                                                                                                                                                                                        0.4s 
 ✔ Volume sentry-self-hosted_sentry-smtp-log        Removed                                                                                                                                                                                        0.4s 
 ✔ Volume sentry-self-hosted_sentry-nginx-cache     Removed                                                                                                                                                                                        0.4s 
 ✔ Volume sentry-self-hosted_sentry-zookeeper-log   Removed                                                                                                                                                                                        0.4s 
 ✔ Network sentry-self-hosted_default               Removed                                                                                                                                                                                        0.1s 
sentry@workhorse:~/self-hosted$ docker volume rm sentry-kafka
sentry-kafka
sentry@workhorse:~/self-hosted$ docker volume rm sentry-zookeeper
sentry-zookeeper

But then reinstall fails

 Volume "sentry-self-hosted_sentry-nginx-cache"  Created
external volume "sentry-zookeeper" not found
Error in install/upgrade-clickhouse.sh:15.
'$dc up -d clickhouse' exited with status 1
-> ./install.sh:main:25
--> install/upgrade-clickhouse.sh:source:15

Event ID

No response

sposs commented 2 weeks ago

Worst thing: I've removed everything (docker system prune -a), but now install always fails due to the missing volume.

sposs commented 2 weeks ago

Apparently, docker system prune -a does not clean the volumes if the location is not standard and that's the reason for the problem reinstalling.

djoeycl commented 2 weeks ago

to get it working again you need to do.

docker volume create sentry-zookeeper docker volume create sentry-kafka