Consumer doesn't restart after instance reboot

matthewaaronthacker commented 8 years ago

The elasticsearch instance got (accidentally) rebooted and afterwards no data was being put into elasticsearch. I checked Kinesis on the AWS Console and it showed 0 get requests. Emailed Daniel who got back to me super quick (thanks Daniel!) and suggested I rerun this portion of the CF script. https://github.com/awslabs/cloudwatch-logs-subscription-consumer/blob/master/configuration/cloudformation/cwl-elasticsearch.template#L675-L682

Dug out the variables from the CF resources and ran that script. Looks like something is happening but I'm still not getting any data in ES. Daniel said this was probably worthy of an issue. So here it is.

I'll gladly send any logs that you need, but I don't even know what to send at this point.

Thanks, matthew

dvassallo commented 8 years ago

The "enhancement" issue here is that the CFN template should be starting the consumer application as a service and have it start automatically after a host reboot. This is currently happening for Elasticsearch and Nginx, but not for the KCL application.

@matthewaaronthacker - Note that the consumer application will restart from where it left off[1], so if it has been offline for several hours it might still be catching up. You may want to check if data is being ingested around the time when your EC2 instance restarted.

Another few other things to check:

First make sure that the subscription consumer process is running:

ps -ef | grep "ElasticsearchConnector"

And if it is, you may want to ensure that there are no errors in /var/log/cloudwatch-logs-subscription-consumer.log. Healthy activity should look something like this:

2015-08-31 19:12:18,300 INFO  Worker - Current stream shard assignments: shardId-000000000000
2015-08-31 19:12:18,300 INFO  Worker - Sleeping ...
2015-08-31 19:12:44,981 INFO  ElasticsearchEmitter - Emitted 128 records to Elasticsearch

[1] The Kinesis buffer length is 24 hours.

matthewaaronthacker commented 8 years ago

It's definitely running. Here's what the log looks like - just seems to be sleeping and nothing else. I hope none of this stuff needs to be redacted.

2015-08-31 19:32:39,585 INFO LeaseTaker - Worker 7c6c6e8cc7fccafe:158d4b2e:14f852bcfcf:-8000 saw 1 total leases, 1 available leases, 1 workers. Target is 1 leases, I have 0 leases, I will take 1 leases 2015-08-31 19:32:40,645 INFO LeaseTaker - Worker 7c6c6e8cc7fccafe:158d4b2e:14f852bcfcf:-8000 successfully took 1 leases: shardId-000000000000 2015-08-31 19:32:47,049 INFO LeaseCoordinator - With failover time 5000ms and epsilon 25ms, LeaseCoordinator will renew leases every 1641ms and take leases every 10050ms 2015-08-31 19:32:47,052 INFO KinesisConnectorExecutorBase - ElasticsearchConnector worker created 2015-08-31 19:32:47,052 INFO KinesisConnectorExecutorBase - Starting worker in ElasticsearchConnector 2015-08-31 19:32:47,052 INFO Worker - Initialization attempt 1 2015-08-31 19:32:47,053 INFO Worker - Initializing LeaseCoordinator 2015-08-31 19:32:47,845 INFO LeaseManager - Table Elasticsearch-cloudwatch already exists. 2015-08-31 19:32:47,882 INFO Worker - Syncing Kinesis shard info 2015-08-31 19:32:48,135 INFO Worker - Starting LeaseCoordinator 2015-08-31 19:32:58,156 INFO Worker - Initialization complete. Starting worker loop. 2015-08-31 19:32:58,255 INFO LeaseTaker - Worker 0ff2051dce615aaa:-19eac647:14f853f5b9c:-8000 saw 1 total leases, 1 available leases, 1 workers. Target is 1 leases, I have 0 leases, I will take 1 leases 2015-08-31 19:32:58,281 INFO LeaseTaker - Worker 0ff2051dce615aaa:-19eac647:14f853f5b9c:-8000 successfully took 1 leases: shardId-000000000000 2015-08-31 19:32:58,392 INFO ElasticsearchEmitter - ElasticsearchEmitter using elasticsearch endpoint 127.0.0.1:9300 2015-08-31 19:32:58,483 INFO plugins - [Ajax] loaded [], sites [] 2015-08-31 19:32:59,319 INFO Worker - Created new shardConsumer for : ShardInfo [shardId=shardId-000000000000, concurrencyToken=dfd20417-2ba3-44c0-aef2-43dd728217af, parentShardIds=[]] 2015-08-31 19:32:59,321 INFO BlockOnParentShardTask - No need to block on parents [] of shard shardId-000000000000 2015-08-31 19:32:59,543 INFO KinesisDataFetcher - Initializing shard shardId-000000000000 with 49553910973532647634987963264196455354009713534901944322 2015-08-31 19:33:47,774 INFO Worker - Current stream shard assignments: shardId-000000000000 2015-08-31 19:33:47,774 INFO Worker - Sleeping ... 2015-08-31 19:35:09,900 INFO Worker - Current stream shard assignments: shardId-000000000000 2015-08-31 19:35:09,900 INFO Worker - Sleeping ... 2015-08-31 19:36:32,986 INFO Worker - Current stream shard assignments: shardId-000000000000 2015-08-31 19:36:32,986 INFO Worker - Sleeping ... 2015-08-31 19:37:46,571 INFO Worker - Current stream shard assignments: shardId-000000000000 2015-08-31 19:37:46,571 INFO Worker - Sleeping ... 2015-08-31 19:38:49,378 INFO Worker - Current stream shard assignments: shardId-000000000000 2015-08-31 19:38:49,378 INFO Worker - Sleeping ... 2015-08-31 19:39:10,305 INFO LeaseRenewer - getCurrentlyHeldLease not returning lease with key shardId-000000000000 because it is expired 2015-08-31 19:41:20,183 INFO Worker - Current stream shard assignments: shardId-000000000000 2015-08-31 19:41:20,183 INFO Worker - Sleeping ... 2015-08-31 19:43:08,053 INFO Worker - Current stream shard assignments: shardId-000000000000 2015-08-31 19:43:08,053 INFO Worker - Sleeping ... 2015-08-31 19:44:33,462 INFO Worker - Current stream shard assignments: shardId-000000000000 2015-08-31 19:44:33,462 INFO Worker - Sleeping ... 2015-08-31 19:46:08,135 INFO Worker - Current stream shard assignments: shardId-000000000000 2015-08-31 19:46:08,135 INFO Worker - Sleeping ...

matthewaaronthacker commented 8 years ago

If I leave it running long enough I eventually get this: 2015-08-31 20:14:03,776 INFO ProcessTask - ShardId shardId-000000000000: getRecords threw ExpiredIteratorException - restarting after greatest seqNum passed to customer com.amazonaws.services.kinesis.model.ExpiredIteratorException: Iterator expired. The iterator was created at time Mon Aug 31 20:08:43 UTC 2015 while right now it is Mon Aug 31 20:14:04 UTC 2015 which is further in the future than the tolerated delay of 300000 milliseconds. (Service: AmazonKinesis; Status Code: 400; Error Code: ExpiredIteratorException; Request ID: d37e15b3-501c-11e5-93e2-c7eee25b0007) at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1160) at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:748) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:467) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:302) at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2472) at com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:1126) at com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.get(KinesisProxy.java:149) at com.amazonaws.services.kinesis.clientlibrary.proxies.MetricsCollectingKinesisProxyDecorator.get(MetricsCollectingKinesisProxyDecorator.java:72) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisDataFetcher.getRecords(KinesisDataFetcher.java:67) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecords(ProcessTask.java:240) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:117) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2

dvassallo commented 8 years ago

Hmm... It looks like everything is working fine on the consumer side, but it seems that it is finding nothing to read from Kinesis.

You may want to check that log data is still flowing to CloudWatch Logs and to the Kinesis stream. You can ssh to your EC2 instance and run the following commands with the AWS CLI (which should be already installed on your EC2 instance):

aws logs describe-subscription-filters \
       --log-group-name <log-group-name> --region <aws-region>

aws logs describe-log-streams \
       --log-group-name <log-group-name> --order-by LastEventTime \
       --descending --region <aws-region>

The <log-group-name> should be the CloudWatch Logs log group name that you're currently ingesting into Elasticsearch. The <aws-region> should be the AWS region (in "us-east-1" format) where your CloudWatch Logs log group resides.

The result of the describe-subscription-filters should tell you whether there is still a subscription between the log group and your Kinesis stream. The presence of a subscription filter should be enough evidence that the link still exists.

The result of the describe-log-streams command should tell you if there is data still flowing into your CloudWatch Logs log group. The first log stream in the result set should be the most recently active. You can pick its lastEventTimestamp value and convert it to a readable form with the date command. Note that you have to remove the last three digits since lastEventTimestamp is returned in milliseconds:

date -d @1437978781

If the date is recent, then there is still data flowing to your CloudWatch Logs log group. If that's the case the problem might be with the configuration of the subscription filter.

One other thing to check would be the Kinesis metrics from the AWS Management Console. You may want to confirm that there is still "Write Throughput" activity. You can see these metrics by picking "Kinesis" from the "Services" toolbar and clicking on the Kinesis stream that was created by the CFN template.

dvassallo commented 8 years ago

The ExpiredIteratorException is not concerning. Kinesis shard iterators expire after 5 minutes[1], but the Kinesis Client Library should be handling those properly by grabbing a new shard iterator.

[1] http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html

matthewaaronthacker commented 8 years ago

That all seems to be normal. There is write activity in the Kinesis stream.

(redacted) { "subscriptionFilters": [ { "filterPattern": "[version, account_id, interface_id, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action, logstatus]", "filterName": "cwl-cfn-es-Elasticsearch-cloudwatch-KinesisSubscriptionStream-***", "roleArn": "arn:aws:iam::***:role/Elasticsearch-cloudwatch-CloudWatchLogsKinesisRole-*_", "creationTime": 1440685573566, "logGroupName": "CloudTrail/FlowLogGroup", "destinationArn": "arn:aws:kinesis:us-east-1:**_:stream/Elasticsearch-cloudwatch-KinesisSubscriptionStream-_***" } ] }

"logStreams": [
    {
        "firstEventTimestamp": 1434050750000,
        "lastEventTimestamp": 1441053097000,

date -d @1441053097 Mon Aug 31 20:31:37 UTC 2015

matthew

matthewaaronthacker commented 8 years ago

Maybe I did something wrong in the command to start the consumer? This is what I ran (redacted but the stream matches my kinesis instance).

{ nohup java -DkinesisInputStream=Elasticsearch-cloudwatch-KinesisSubscriptionStream-****\ -DregionName=us-east-1 -DappName=Elasticsearch-cloudwatch -Dlog4j.configuration=log4j-prod.properties -DelasticsearchClusterName=elasticsearch -cp /root/cloudwatch-logs-subscription-consumer-1.2.0/cloudwatch-logs-subscription-consumer-1.2.0.jar com.amazonaws.services.logs.connectors.samples.elasticsearch.ElasticsearchConnector > /dev/null 2>&1 & } && disown -h %1

matthew

matthewaaronthacker commented 8 years ago

Daniel, You need any other logs or anything from me on this? If not I'm going to go ahead and kill the stack and recreate it. I'm not deep enough in to be losing anything of value and it's easy enough to start over. Thanks for all your help! matthew

dvassallo commented 8 years ago

@matthewaaronthacker - Unfortunately it is still not clear why your consumer app is not finding anything in Kinesis. Feel free to tear down the cluster and bring it back up. Let us know if you run into the same issue again.

I would leave this issue open as an enhancement request because the CFN template should automatically set up the consumer to restart automatically after an EC2 instance reboot.

amazon-archives / cloudwatch-logs-subscription-consumer

Consumer doesn't restart after instance reboot #3