awslabs / dynamodb-cross-region-library

A library to facilitate cross-region replication with Amazon DynamoDB Streams.
Apache License 2.0
275 stars 98 forks source link

Kinesis NullPointer Exception #16

Closed DaveWK closed 8 years ago

DaveWK commented 8 years ago

Hi,

I am attempting to set up a second replication group from an existing table. The copy appears to be working, and takes ~16 hours. After the copy finishes, the DynamoDBReplicationConnector task shows up, and is running. It is not keeping the tables in sync.

I see this in the log: ERROR com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask - Application exception. java.lang.NullPointerException at com.amazonaws.services.kinesis.connectors.KinesisClientLibraryPipelinedRecordProcessor.shutdown(KinesisClientLibraryPipelinedRecordProcessor.java:160) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Also seeing this: com.amazonaws.AmazonClientException: Unable to execute HTTP request: Timeout waiting for connection from pool at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:478) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:302) at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:1581) at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.putItem(AmazonDynamoDBClient.java:746) at com.amazonaws.services.dynamodbv2.AmazonDynamoDBAsyncClient$20.call(AmazonDynamoDBAsyncClient.java:920) at com.amazonaws.services.dynamodbv2.AmazonDynamoDBAsyncClient$20.call(AmazonDynamoDBAsyncClient.java:916) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) at com.amazonaws.http.conn.$Proxy8.getConnection(Unknown Source) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:706) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:467)

dymaws commented 8 years ago

Hi,

Are you building this library from the source or running it from the publicly built cloud formation template? If you aren't using the Cloud Formation template you could try running everything from the start by referring to the steps here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.CrossRegionRepl.Walkthrough.html

Anyhow, your replication group should work if you have enough ECS instances running. Are you trying to replicate the same table to multiple regions? If so, you do not need to create a second replication group, but instead just a replica to your existing replication group by selecting the group then choosing "Edit", from there you should see an option to Add Replica.

Now, after you add the replica, you should see the status eventually transition into ACTIVE for the new replication path, as well as for the entire group. I believe this is the stage you are currently at right now, but having trouble actually seeing the synced data? Did you get this working for another replication group? Do you have enough instances running inside your ECS cluster? Please refer to the section of the troubleshooting guide which outlines how to add more EC2 instances to your ECS cluster: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.CrossRegionRepl.Troubleshooting.html#Streams.CrossRegionRepl.Troubleshooting.StuckCreating

Finally, it sounds like you may have enough instances running because in your description you said your DynamoDBConnector task is in RUNNING state. However, from the logs it appears there aren't enough connections in the connection pool, suggesting the connections aren't be closed/reused fast enough, indicating resource contention. You could try upgrading the instance type of your EC2 instance inside the ECS cluster, we suggest starting with a c instance type, perhaps a c4.large for medium-high workloads.

Let us know if you have any more questions. Thanks.

DaveWK commented 8 years ago

I am using the CF template. It appears that the issue is related to not enough ECS instances. I was able to run the Table Copy service successfully, which confused me, since I thought when the copy table task/service finished, it would just start another replicator docker container with the same resources as the table copy was using.. It finished BOOTSTRAPPING and was allegedly in an active state.

It also appears the replication coordinator may have maxed out it's resources coordinating one instance.

In summary, I replaced 2 t1.micro instances I had in the cluster with a single m3.medium instance, and it appears to be working happily now.