ControlSystemStudio / phoebus

A framework and set of tools to monitor and operate large scale control systems, such as the ones in the accelerator community.
http://phoebus.org/
Eclipse Public License 1.0
82 stars 88 forks source link

Alarm-logger - Kafka connection - shutdown_client #3033

Open GDH-ISIS opened 4 weeks ago

GDH-ISIS commented 4 weeks ago

I am looking for advice for the alarm logger (compiled 31 May 2024) - we have a system running as docker containers whereby the alarm-logger encounters an exception (roughly every day). System starts fine, alarms appearing as expected - three topics (Accelerator, Upper, Lower manually configured in Kafka via script with auto.create=false). However after a period of time, the alarm-logger container seems to stop with an exception message (it sometimes seems as though this might coincide with some of our EPICS IOCs restarting, although I have no definitive evidence of this - still investigating). Note - on restarting only the alarm logger, the services appear to catch-up and the alarms appear in the Windows phoebus client as expected.

Sorry to ask, but I have been investigating the configurable parameters for some time without a resolution. Any advice much appreciated.

GDH-ISIS commented 4 weeks ago

2024-05-31 17:35:08 INFO [org.phoebus.alarm.logging.AlarmLoggingService] Alarm Logging Service (PID 1) Commands: help - Show help. shutdown - Shut alarm logger down and exit.

Jun 01, 2024 1:04:24 PM org.apache.kafka.streams.KafkaStreams handleStreamsUncaughtException SEVERE: stream-client [streams-Accelerator-alarm-messages-6a72af80-55e9-4d7d-a504-a7b26dc83572] Encountered the following exception during processing and the registered exception handler opted to SHUTDOWN_CLIENT. The streams client is going to shut down now. org.apache.kafka.streams.errors.StreamsException: org.apache.kafka.common.KafkaException: Encountered corrupt message when fetching offset 2135773 for topic-partition Accelerator-0 at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:657) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:579) Caused by: org.apache.kafka.common.KafkaException: Encountered corrupt message when fetching offset 2135773 for topic-partition Accelerator-0 at org.apache.kafka.clients.consumer.internals.AbstractFetch.handleInitializeCompletedFetchErrors(AbstractFetch.java:641) at org.apache.kafka.clients.consumer.internals.AbstractFetch.initializeCompletedFetch(AbstractFetch.java:514) at org.apache.kafka.clients.consumer.internals.AbstractFetch.collectFetch(AbstractFetch.java:283) at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1262) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1186) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1159) at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:1014) at org.apache.kafka.streams.processor.internals.StreamThread.pollPhase(StreamThread.java:962) at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:766) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:617) ... 1 more

kasemir commented 4 weeks ago

Looks like the error is completely outside of anything we did. This is purely between the Kafka server and the Kafka client lib, and contrary to the alarm server and GUI client, in this case we're not even directly using the Kafka client lib but leave that to the Kafka streams API.

You could try googling for "KafkaException: Encountered corrupt message when fetching offset" to see if there's any mention of related issues in the Kafka community (I tried that but nothing sticks out).

Have you by any chance been running this before the Kafka version update (https://github.com/ControlSystemStudio/phoebus/pull/2953), so is this a new issue that didn't exist before the update?

The fact that a restarted alarm logger catches up fine, and no other GUI clients crash in the same way, both suggest that the issue is in the logger client; the data in the Kafka server is fine. Still, maybe there's some delay or file locking issue caused by running all this in docker containers? Kafka server sends out half a message, delays, then sends the rest? Try running everything on a Linux host to see if container vs. no container makes a difference. For what it's worth, containers make a lot of sense when you need to run 1000+ copies of the same thing on some container farm. For one instance of Kafka, one or 3 alarm servers, one alarm logger I'm not sure you gain anything from containers....

GDH-ISIS commented 2 weeks ago

Will investigate further and report back.