Exception trying to fetch the data from the second Kinesis stream located in another AWS region

mikalai-t commented 4 years ago

Hello.

Description

The first input was configured to fetch logs from the Kinesis stream in the us-east-1 AWS region and working now. Then I was able to configure the second input and even received a message to pass the configuration check. But after that no more messages have been received via this input. Second input configuration is the same as the first one except it's located in another eu-west-2 region .

Steps To Reproduce

In the region us-east-1:
- Create EC2 Autoscaling Group (min=1, max=1).
- Create ECS cluster
- Attach EC2 instance to the cluster.
- Create Graylog ECS Task and configure task role to fetch logs from any Kinesis stream.
- Start a ECS Service to run Graylog container on the single EC2 node.
Still in the region us-east-1:
- Create CloudWatch Log group.
- Create Kinesis stream (1 shard, 24h retention).
- Create CloudWatch subscription.
In the region eu-west-2:
- Create CloudWatch Log group.
- Create Kinesis stream (1 shard, 24h retention).
- Create CloudWatch subscription.
Login into Graylog and configure 2 AWS Kinesis/CloudWatch inputs from both aforementioned Kinesis streams.

Only the first input will be working. The second throws exception:

2020-03-18 09:03:37,421 ERROR: software.amazon.kinesis.lifecycle.InitializeTask - Caught exception: 
software.amazon.awssdk.services.kinesis.model.InvalidArgumentException: StartingSequenceNumber 49603881713016572917281669991559719292395929931979161602 used in GetShardIterator on shard shardId-000000000000 in stream <project name>-stg-logs-stream under account <account id> is invalid because it did not come from this stream. (Service: Kinesis, Status Code: 400, Request ID: <request id>)
at software.amazon.awssdk.services.kinesis.model.InvalidArgumentException$BuilderImpl.build(InvalidArgumentException.java:118) ~[?:?]
at software.amazon.awssdk.services.kinesis.model.InvalidArgumentException$BuilderImpl.build(InvalidArgumentException.java:78) ~[?:?]
at software.amazon.awssdk.protocols.json.internal.unmarshall.AwsJsonProtocolErrorUnmarshaller.unmarshall(AwsJsonProtocolErrorUnmarshaller.java:88) ~[?:?]
at software.amazon.awssdk.protocols.json.internal.unmarshall.AwsJsonProtocolErrorUnmarshaller.handle(AwsJsonProtocolErrorUnmarshaller.java:63) ~[?:?]
at software.amazon.awssdk.protocols.json.internal.unmarshall.AwsJsonProtocolErrorUnmarshaller.handle(AwsJsonProtocolErrorUnmarshaller.java:42) ~[?:?]
at software.amazon.awssdk.core.internal.http.async.AsyncResponseHandler.lambda$prepare$0(AsyncResponseHandler.java:88) ~[?:?]
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:966) ~[?:1.8.0_242]
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) ~[?:1.8.0_242]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) ~[?:1.8.0_242]
at software.amazon.awssdk.core.internal.http.async.AsyncResponseHandler$BaosSubscriber.onComplete(AsyncResponseHandler.java:129) ~[?:?]
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.runAndLogError(ResponseHandler.java:171) ~[?:?]
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.access$500(ResponseHandler.java:68) ~[?:?]
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$PublisherAdapter$1.onComplete(ResponseHandler.java:287) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.HandlerPublisher.complete(HandlerPublisher.java:408) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.HandlerPublisher.handlerRemoved(HandlerPublisher.java:395) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.callHandlerRemoved(AbstractChannelHandlerContext.java:972) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.DefaultChannelPipeline.callHandlerRemoved0(DefaultChannelPipeline.java:641) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.DefaultChannelPipeline.remove(DefaultChannelPipeline.java:481) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.DefaultChannelPipeline.remove(DefaultChannelPipeline.java:427) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.http.HttpStreamsHandler.removeHandlerIfActive(HttpStreamsHandler.java:328) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.http.HttpStreamsHandler.handleReadHttpContent(HttpStreamsHandler.java:189) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.http.HttpStreamsHandler.channelRead(HttpStreamsHandler.java:165) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.http.HttpStreamsClientHandler.channelRead(HttpStreamsClientHandler.java:148) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
at software.amazon.awssdk.http.nio.netty.internal.LastHttpContentHandler.channelRead(LastHttpContentHandler.java:43) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
at software.amazon.awssdk.http.nio.netty.internal.http2.Http2ToHttpInboundAdapter.onDataRead(Http2ToHttpInboundAdapter.java:66) ~[?:?]
at software.amazon.awssdk.http.nio.netty.internal.http2.Http2ToHttpInboundAdapter.channelRead0(Http2ToHttpInboundAdapter.java:44) ~[?:?]
at software.amazon.awssdk.http.nio.netty.internal.http2.Http2ToHttpInboundAdapter.channelRead0(Http2ToHttpInboundAdapter.java:38) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.handler.codec.http2.AbstractHttp2StreamChannel$Http2ChannelUnsafe.doRead0(AbstractHttp2StreamChannel.java:851) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.handler.codec.http2.AbstractHttp2StreamChannel$Http2ChannelUnsafe.doBeginRead(AbstractHttp2StreamChannel.java:793) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.handler.codec.http2.AbstractHttp2StreamChannel$Http2ChannelUnsafe.beginRead(AbstractHttp2StreamChannel.java:765) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.read(DefaultChannelPipeline.java:1374) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeRead(AbstractChannelHandlerContext.java:685) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.AbstractChannelHandlerContext.read(AbstractChannelHandlerContext.java:670) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.HandlerPublisher.requestDemand(HandlerPublisher.java:86) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.http.HttpStreamsHandler$1.requestDemand(HttpStreamsHandler.java:157) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.HandlerPublisher.flushBuffer(HandlerPublisher.java:311) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.HandlerPublisher.receivedDemand(HandlerPublisher.java:258) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.HandlerPublisher.access$200(HandlerPublisher.java:41) ~[?:?]
at software.amazon.awssdk.thirdparty.com.typesafe.netty.HandlerPublisher$ChannelSubscription$1.run(HandlerPublisher.java:452) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) ~[?:?]
at software.amazon.awssdk.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]

Environment

Graylog Version: 3.2.2+2f9a628, codename Ethereal Elk
JVM: PID 19, Oracle Corporation 1.8.0_242 on Linux 4.14.165-133.209.amzn2.x86_64
Installed plugins:
- AWS plugins: 3.2.2
- Collector: 3.2.2
- Integrations: 3.2.2
- Threat Intelligence Plugin: 3.2.2
Elasticsearch Version: [LTm5Sek] version[6.8.5], pid[1], build[oss/docker/78990e9/2019-11-13T20:04:24.100411Z], OS[Linux/4.14.165-133.209.amzn2.x86_64/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/13.0.1/13.0.1+9]
MongoDB Version: 3.6.17
Browser Version: Google Chrome | 79.0.3945.88 (Official Build) (64-bit)

Thank you!

bernd commented 4 years ago

@waab76 Can you take a look at this? Thank you!

waab76 commented 4 years ago

As discussed on the AWS forums here, for every Kinesis consumer application, AWS maintains a DynamoDB table with the application configuration. The DynamoDB table name is the same as the Kinesis consumer application name. When you create an AWS Kinesis/Cloudwatch input in Graylog, it uses "graylog-aws-plugin-" as the consumer application name. The Kinesis error above tends to occur when two consumers for different streams have the same application name (in the same region) and thus both try to read stream status from the same table.

@mikalai-t Could you provide screenshots showing your input configuration (from the Graylog > System/Inputs > Inputs page)? Also, could you check for the appropriate DynamoDB tables being created in us-east-1 and us-east-2?

mikalai-t commented 4 years ago

Hello. Sadly, but I can't provide a screenshot because one of the projects was suspended and I removed Graylog installation in order to reduce AWS monthly cost. I checked DynamoDB tables in both regions and they looked correct (hidden parts contain different project names).

in eu-west-2:

in us-east-1:

If this information is not enough feel free to close the issue. I will re-open next time I face it.

Thank you anyway!

waab76 commented 4 years ago

The error initially reported typically happens when someone creates a new Kinesis consumer application that has the same name as an existing (or old) Kinesis consumer app and thus the new Kinesis consumer ends up trying to re-use the DynamoDB table that was tracking status for the old Kinesis consumer. Currently, we name the Kinesis consumer app based on what the user named their Input. I don't think we can stop users from re-using an old input name (and thus running into this issue), so I think we have two possible fixes to ensure users don't run into this issue:

1) When a user removes a Kinesis input, we attempt to clean up (remove) the corresponding DynamoDB table. This may not work because the user may not have given us AWS credentials that allow us to delete Dynamo tables. Also we probably should not be doing things in the user's AWS account without getting explicit permission first.

1) We move the consumer app name generation up from the KinesisConsumer class to the Input class (so the name can be stored as with input configuration) and introduce some randomness to prevent DDB table name collisions if the user creates a new Kinesis input that has a potential name collision with an old input.

I'm going to dive deeper on the second option to see if it will work as expected.

mikalai-t commented 4 years ago

Well, most of the resources I created with Terraform, excluding DynamoDB Table (I didn't know what definite stack is required). I think that's acceptable to remove the table used only by this plugin even from Graylog. If there is no appropriate permission provided - let the plugin catch an "Access Denied" exception, show notification and record into log, for example.
It's still not clear for me what "table name collision" are you talking about? I showed 2 different tables in different regions...

I think, I'll have to restore Graylog ECS instance to shed light on Kinesis Inputs config ))

waab76 commented 4 years ago

The table name collision comes from a situation like this:

Create a new Kinesis input in a particular region named my_input
1. Graylog builds a Kinesis consumer named graylog-aws-plugin-my_input (this lives inside of Graylog)
2. AWS creates a DDB table named graylog-aws-plugin-my_input to help manage the Kinesis consumer
Allow the input to run for a while
1. AWS uses the graylog-aws-plugin-my_input DDB table to keep track of checkpointing and such on the Kinesis stream
Remove the Kinesis input named my_input
1. Graylog destroys its Kinesis consumer named graylog-aws-plugin-my_input
2. AWS leaves the graylog-aws-plugin-my_input DDB table intact, just in case the consumer wants to resume processing later
Time passes
Create a new Kinesis input in the same region as before named my_input but pointing to a different Kinesis stream
1. Graylog creates a new Kinesis consumer named graylog-aws-plugin-my_input
2. AWS re-uses the existing graylog-aws-plugin-my_input DDB table, which has checkpointing info from the old stream resulting in the error reported above

mikalai-t commented 4 years ago

You're right. Indeed I re-created second (non working) input several times because there weren't new messages in Graylog and I thought I did something wrong creating the input. So, today I removed both DynamoDB tables then I restored Graylog ECS application, configured 2 new Kinesis inputs and allowed Graylog to create new DDB tables. Everything is working now!

Thank you much for your time and clarification!

waab76 commented 4 years ago

Reopening so the team can implement better handling of this error case.

stepkirk commented 4 years ago

The error initially reported typically happens when someone creates a new Kinesis consumer application that has the same name as an existing (or old) Kinesis consumer app and thus the new Kinesis consumer ends up trying to re-use the DynamoDB table that was tracking status for the old Kinesis consumer. Currently, we name the Kinesis consumer app based on what the user named their Input. I don't think we can stop users from re-using an old input name (and thus running into this issue), so I think we have two possible fixes to ensure users don't run into this issue:

When a user removes a Kinesis input, we attempt to clean up (remove) the corresponding DynamoDB table. This may not work because the user may not have given us AWS credentials that allow us to delete Dynamo tables. Also we probably should not be doing things in the user's AWS account without getting explicit permission first.

We move the consumer app name generation up from the KinesisConsumer class to the Input class (so the name can be stored as with input configuration) and introduce some randomness to prevent DDB table name collisions if the user creates a new Kinesis input that has a potential name collision with an old input.

I'm going to dive deeper on the second option to see if it will work as expected.

Is option 2 still in the works?

We have a different use-case for this change. We would like to have two different Graylog instances reading from the same Kinesis data stream in AWS. We tried setting this up but found that the 2nd Graylog instance was not pulling data from Kinesis. I think it is due to the fact that there is a single DDB table associated with the Kinesis stream? The first instance is pulling the data from Kinesis and then updating the DDB table. The 2nd Graylog instance never thinks there is any new data to pull. Is that the case?

Having a unique DDB table name per Graylog instance would solve this. Option 2 would seem to be the solution.

gabricar commented 1 year ago

Any update on this item? I have the same issue, only the first input is collecting the logs and the node is always overloaded and sometimes getting killed due to a lack of memory

Graylog Version: Graylog 5.0.3

Graylog2 / graylog-plugin-integrations