CheckpointWorker Actor is not killed when the lease is lost

aquamatthias commented 7 years ago

In our setup we have a Kinesis stream with 2 shards and 2 consumer (C1 and C2 in my description). Both consumer C1 and C2 will get a lease for one shard and read from the stream as expected. If C2 fails (process dies), C1 will read from both shards. C2 restarts and will take over a lease for one shard. The CheckPointWorkerActor is not killed when a lease is lost. We see the following debug output in C1:

Sep 28, 2017 10:32:13 AM com.amazonaws.services.kinesis.leases.impl.LeaseRenewer renewLease
INFORMATION: Worker b1dbae75-b26a-4c4a-a037-236d0546144a lost lease with key shardId-000000000001
Sep 28, 2017 10:32:13 AM com.amazonaws.services.kinesis.clientlibrary.lib.worker.SequenceNumberValidator validateSequenceNumber
INFORMATION: Validated sequence number 49577403242574069570051621516948183134887434789161271314 with shard id shardId-000000000001
Sep 28, 2017 10:32:13 AM com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator setCheckpoint
INFORMATION: Worker b1dbae75-b26a-4c4a-a037-236d0546144a could not update checkpoint for shard shardId-000000000001 because it does not hold the lease
10:32:13.962 [kinesis-akka.actor.default-dispatcher-4] INFO com.weightwatchers.reactive.kinesis.consumer.CheckpointWorker - Caught shutdown exception, skipping checkpoint

From here on, we see this exception logged in the checkpoint interval (every 3 seconds):

10:32:59.418 [kinesis-akka.actor.default-dispatcher-3] INFO com.weightwatchers.reactive.kinesis.consumer.CheckpointWorker - Caught shutdown exception, skipping checkpoint.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:174)
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
    at com.weightwatchers.reactive.kinesis.consumer.CheckpointWorker.$anonfun$checkpointOrBackoff$1(CheckpointWorker.scala:231)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
    at scala.util.Try$.apply(Try.scala:209)
    at com.weightwatchers.reactive.kinesis.consumer.CheckpointWorker.checkpointOrBackoff(CheckpointWorker.scala:230)
    at com.weightwatchers.reactive.kinesis.consumer.CheckpointWorker.com$weightwatchers$reactive$kinesis$consumer$CheckpointWorker$$checkpointAndRespond(CheckpointWorker.scala:204)
    at com.weightwatchers.reactive.kinesis.consumer.CheckpointWorker$$anonfun$readyState$1$1.applyOrElse(CheckpointWorker.scala:146)
    at akka.actor.Actor.aroundReceive(Actor.scala:514)
    at akka.actor.Actor.aroundReceive$(Actor.scala:512)
    at com.weightwatchers.reactive.kinesis.consumer.CheckpointWorker.aroundReceive(CheckpointWorker.scala:117)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
    at akka.actor.ActorCell.invoke(ActorCell.scala:496)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Everything works as expected, since the lease is taken over by another consumer. Nevertheless I see this as a problem, since the logs will emit warnings, which are no warnings. Not sure if this can lead to other unwanted behaviour?

markglh commented 7 years ago

Interesting, the shutdown scenarios definitely need thorough testing (we have a couple of outstanding issues). This is an example of why I've not closed those issues yet :) I've been slowly hardening these as I go, those issues will live on until I'm 100% happy.

Are you able to share any test code which simulates this?

aquamatthias commented 7 years ago

@markglh I started to implement an akka stream source on top of reactive-kinesis. The Source looks like this:

import akka.Done
import akka.actor.Actor.Receive
import akka.actor.Status.Failure
import akka.actor.{ActorRef, ActorSystem}
import akka.pattern.pipe
import akka.stream.{Attributes, BufferOverflowException, Outlet, SourceShape}
import akka.stream.stage.{GraphStage, GraphStageLogic, OutHandler}
import com.typesafe.scalalogging.LazyLogging
import com.weightwatchers.reactive.kinesis.consumer.KinesisConsumer
import com.weightwatchers.reactive.kinesis.consumer.KinesisConsumer.ConsumerConf
import com.weightwatchers.reactive.kinesis.consumer.ConsumerWorker.{ConsumerShutdown, ConsumerWorkerFailure, EventProcessed, ProcessEvent}
import com.weightwatchers.reactive.kinesis.models.ConsumerEvent

/**
  * The KinesisEvent is passed through the stream.
  * Every event has to be committed explicitly.
  */
sealed trait KinesisEvent {
  def event: ConsumerEvent
  def commit(successful: Boolean = true): KinesisEvent
}

/**
  * Actor based implementation of KinesisEvent.
  */
private[kinesis] case class KinesisActorEvent(event: ConsumerEvent)(implicit sender: ActorRef) extends KinesisEvent {
  def commit(successful: Boolean = true): KinesisEvent = {
    sender ! EventProcessed(event.sequenceNumber, successful)
    this
  }
}

/**
  * A KinesisSourceGraph will attach to a kinesis stream with the provided configuration and constitute a Source[KinesisEvent, NotUsed].
  * Usage:
  * {{{
  *   val config = ConfigFactory.load()
  *   val consumerConfig = ConsumerConf(config.getConfig("kinesis"), "some-consumer")
  *   val source = Source.fromGraph(new KinesisSourceGraph(consumerConf, system))
  *
  * }}}
  * Assuming a configuration file like this:
  * {{{
  * kinesis {
  *    application-name = "SampleService"
  *    some-consumer {
  *       stream-name = "sample-consumer"
  *    }
  * }
  * }}}
  * See reference.conf for a list of all available config options.
  *
  * @param config the kinesis stream configuration.
  * @param actorSystem the actor system.
  */
class KinesisSourceGraph(config: ConsumerConf, actorSystem: ActorSystem) extends GraphStage[SourceShape[KinesisEvent]] with LazyLogging {

  private[this] val out: Outlet[KinesisEvent] = Outlet("KinesisSource.out")
  override val shape: SourceShape[KinesisEvent] = SourceShape.of(out)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) {
    // KCL will read events in batches. The stream should be able to buffer the whole batch.
    private[this] val bufferSize: Int = config.kclConfiguration.getMaxRecords
    // The queue to buffer events that can not be pushed downstream.
    private[this] val messages  = new java.util.ArrayDeque[KinesisEvent]()
    // The kinesis consumer to read from.
    private[this]var kinesisConsumer: Option[KinesisConsumer] = None

    setHandler(out, new OutHandler {
      override def onPull(): Unit = if (!messages.isEmpty) push(out, messages.poll())
    })

    override def preStart(): Unit = {
      super.preStart()
      // the underlying stage actor reference of this graph stage.
      val actor = getStageActor(receive).ref
      val consumer = KinesisConsumer(config, actor, actorSystem)
      // start() creates a long running future that returns, if the consumer worker is done or failed.
      import actorSystem.dispatcher
      consumer.start().map(_ => Done).pipeTo(actor)
      kinesisConsumer = Some(consumer)
    }

    override def postStop(): Unit = {
      logger.info(s"Stopping Source $out. ${messages.size()} messages are buffered unprocessed.")
      kinesisConsumer.foreach(_.stop())
      super.postStop()
    }

    def receive: Receive = {
      case (_, _: ProcessEvent) if messages.size > bufferSize =>
        // The messages are processed not fast enough, so the messages in the buffer exceeds the maximum buffersize
        // Fail the stage to prevent message overflow.
        // Ideally we could control when the next batch is fetched.
        logger.warn(s"Buffer of size $bufferSize is full. Fail the stream.")
        failStage(BufferOverflowException(s"Buffer overflow (max capacity was: $bufferSize)!"))

      case (actorRef: ActorRef, ProcessEvent(event)) =>
        // A new event that needs to be processed.
        // Always use the queue to guarantee the correct order of messages.
        logger.info(s"Process event: $event")
        messages.offer(KinesisActorEvent(event)(actorRef))
        while (isAvailable(out) && !messages.isEmpty) push(out, messages.poll())

      case (_, ConsumerShutdown(shardId)) =>
        // A consumer shutdown occurs when another source is created and hence the Kinesis shards are rebalanced.
        // This message is received, if the graceful shutdown is finalized.
        // Once https://github.com/WW-Digital/reactive-kinesis/issues/32 is fixed, we should drop all buffered
        // messages of this shard.
        logger.info(s"Consumer shutdown for shard $shardId")

      case (_, ConsumerWorkerFailure(failedEvents, shardId)) =>
        // Send for all events of a batch, where the processing has failed (after configured retries)
        // Since proceeding is not possible, the stream is failed.
        logger.error(s"Consumer worker failure for shard $shardId")
        failStage(new IllegalStateException(s"Failed Events: $failedEvents for shardId: $shardId"))

      case (_, Done) =>
        // The KinesisConsumer has been finished, so the stage is completed.
        logger.info("Kinesis Consumer finished.")
        completeStage()

      case (_, Failure(ex)) =>
        // The KinesisConsumer failed. This should also fail the stage.
        logger.error("Kinesis Consumer failed", ex)
        failStage(ex)
    }
  }
}

A main class is needed which creates a simple stream and reads from the stream:

import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Sink, Source}
import com.typesafe.scalalogging.StrictLogging
import com.weightwatchers.reactive.kinesis.consumer.KinesisConsumer.ConsumerConf

import scala.concurrent.Await
import scala.concurrent.duration.Duration

/**
  * Simple Test Consumer: This process will read from the configured stream.
  * Start as many consumers as wanted - each worker has a unique id.
  */
object TestConsumer extends StrictLogging {

  def main(args: Array[String]): Unit = {

    implicit val actorSystem: ActorSystem = ActorSystem("kinesis")
    implicit val materializer: ActorMaterializer = ActorMaterializer()

    val conf = ConsumerConf(actorSystem.settings.config.getConfig("kinesis"), "test-consumer")
    val result = Source.fromGraph(new KinesisSourceGraph(conf, actorSystem))
      .map { e =>
        e.commit()
        logger.info(s"Received: ${e.event.payload}")
      }
      .runWith(Sink.ignore)

    Await.ready(result, Duration.Inf)
    sys.exit(1)
  }
}

The log statements are derived from starting/stopping 2 processes of TestConsumer from a kinesis stream with 2 shards.

BTW: I saw akka streams on your road map as well. Do you want me to create a PR with this implementation?

markglh commented 7 years ago

Great, thanks for sharing that!

I don't see why not - I might struggle to get enough time to really dig into the PR in the next couple of weeks though - I'd like to really think about the implementation before we merge, especially the failure and checkpoint scenarios. What I like to avoid is changing the interface massively after merging.

That said, let's get the ball rolling :D

WW-Digital / reactive-kinesis

CheckpointWorker Actor is not killed when the lease is lost #33