Replication connector fails silently

freedomofkeima commented 8 years ago

Initial condition

I have 6 DynamoDB tables which are replicated throughout two different regions. After several weeks of running without encountering this reported problem, I've realized that DynamoDB replication console does not replicate LSI, so I need to create the table by myself.

I've removed all replication groups from the replication console and I've deleted all tables from the replicas region. After that, I recreated those fresh tables (with same name as the previous one) with LSI. Then, I recreated the replication group with the existing tables in both regions.

Problem

After finishing DynamoDBTableCopy task, DynamoDBReplicationConnector is executed. However, at a certain point, the replication in 3 out of 6 tables stopped running silently.

screen shot 2016-01-07 at 11 18 24 am

As you can see in the screenshot, it works for an hour and half before it fails. I've tried to remove the replication group again, remove KCL checkpoint table for those replication groups, and recreate it again without avail.

In addition, I've tried to increase the number of workers from 1 to 3 (for each replication groups with 1 Master and 1 replica). This problem keeps occurring, even with small throughput (< 5 operations per second).

Current condition

ECS still shows that my task is running properly.

screen shot 2016-01-07 at 11 14 27 am

I've decided to access the machine directly and executed docker ps inside the machine. The process is still running up till now.

I've checked the CloudWatch log, but it does not show any error messages. The last error message that I receive, is not related to this problem (from the timestamp, > 24 hours ago).

screen shot 2016-01-07 at 11 15 34 am

The only pattern that I notice is leaseCounter in KCL tables. In the failed groups, the number of leaseCounter keeps increasing periodically until it reaches 2000 in less than 24 hours (~ 13 hours). In the running groups, the number of leaseCounter is less than 700.

screen shot 2016-01-07 at 11 19 37 am

Any idea regarding the cause of this problem? Thank you.

dymaws commented 8 years ago

Hi,

Thank you for providing the detailed logs. We are actively investigating this issue and expect to push out a fix shortly. In the meantime, it would help us if you could provide us with more details regarding your table, usage of the cross-region replication library and additional logs.

In order to facilitate the communication process, could you please reach out to AWS support (https://aws.amazon.com/premiumsupport/) and have them contact the DynamoDB team. They will be able to make the data passing process more convenient.

We will update the issue once the root cause has been identified.

freedomofkeima commented 8 years ago

Hi,

Thanks for your reply. I've opened a case report to AWS support (Case ID 1612345641), but I haven't received more replies since three days ago.

In the last few days, I've tried to switch the instance type from t2.micro to m3.medium, but the problem still persists. The steps that I've done are as the following:

Create a table with > 100k records as the master replica. This table will have a real-time write operations (around 5 - 10 operations per second). This table has a stream enabled (New and old images view type).
Create a table with same schema in other region.
In the replication console, I chose the (1) as master table and (2) as secondary table. In addition, I've provisioned 100 read throughput to (1) and 1000 write throughput to (2).
The bootstrapping process will start with DynamoDBTableCopy. I've noticed a small problem here: If I use small instance type (t2.micro), the process will fail and the replication status will instantly change from CREATING to ACTIVE. Therefore, I use larger instance type (m3.medium) for this particular step.
After receiving TableCopy has reached terminal state, running callback for TaskStatus COMPLETE message from Cloudwatch log, DynamoDBTableCopy will stop and DynamoDBReplicationConnector will start. At this point, I changed 1000 write throughput at (2) to 10 write throughput.
Ranging from 5 minutes to 12 hours, the replication process will fail silently (as seen in the screenshot above). The process is still shown as running in both Docker and ECS agent. In addition, there is no additional log events in Cloudwatch.

Thank you for your assistance.

There are some sporadic errors in DynamoDBCrossRegionReplicationConnectors but I'm not sure whether these errors are related or not (because the timestamp is different).

com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask - Application exception. java.lang.IllegalArgumentException: Application didn't checkpoint at end of shard shardId-00000001451986853686-ed7562c6
com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask - Caught exception: com.amazonaws.services.kinesis.clientlibrary.exceptions.internal.KinesisClientLibIOException: Incomplete shard list: Closed shard shardId-00000001451156300838-79379b26 has no children.This can happen if we constructed the list of shards while a reshard operation was in progress.
com.amazonaws.services.dynamodbv2.streams.connectors.DynamoDBReplicationEmitter$1 - Exception emitting record: {EventID: 0392dfe6307cb091ab87e7f48de3e0b9,EventName: MODIFY,EventVersion: 1.0,EventSource: aws:dynamodb,AwsRegion: ap-northeast-1,Dynamodb: {...,SequenceNumber: 268399600000000000538999663,SizeBytes: 535,StreamViewType: NEW_AND_OLD_IMAGES}} com.amazonaws.AmazonClientException: Unable to execute HTTP request: Read timed out

dymaws commented 8 years ago

Hi,

We have root caused the issue and shipped out a workaround that should resolve the issue. Please delete and re-create your replication group and monitor if this issue is still present.

Thank you.

freedomofkeima commented 8 years ago

Thanks for your quick fix. I will try to recreate the replication groups and share the result here later on.

freedomofkeima commented 8 years ago

Hello,

I've noticed two separate issues here.

1) The operation of TableCopy seems to work, but I don't see any write metrics in my CloudWatch Monitoring here. Is it intentional?

screen shot 2016-01-13 at 12 05 49 pm

But I still received this message when I tried to decrease the write throughput:

screen shot 2016-01-13 at 12 12 48 pm

2) There's a problem with DynamoDBReplicationConnector.

screen shot 2016-01-13 at 12 06 36 pm

Upon further inspection into the instance, I found the following logs:

--2016-01-13 03:05:15--  https://s3.amazonaws.com/dynamodb-cross-region/DynamoDBConnectors.jar
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.34.40
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.34.40|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2016-01-13 03:05:18 ERROR 403: Forbidden.

UPD:

It seems to work miraculously 20 minutes later. I will report it here again if something happens.

Problem 1 only occurs during DynamoDBTableCopy task.

freedomofkeima commented 8 years ago

It's working properly. Thank you for the fix, I'll close this issue.

awslabs / dynamodb-cross-region-library