Closed freedomofkeima closed 8 years ago
Hi,
Thank you for providing the detailed logs. We are actively investigating this issue and expect to push out a fix shortly. In the meantime, it would help us if you could provide us with more details regarding your table, usage of the cross-region replication library and additional logs.
In order to facilitate the communication process, could you please reach out to AWS support (https://aws.amazon.com/premiumsupport/) and have them contact the DynamoDB team. They will be able to make the data passing process more convenient.
We will update the issue once the root cause has been identified.
Hi,
Thanks for your reply. I've opened a case report to AWS support (Case ID 1612345641), but I haven't received more replies since three days ago.
In the last few days, I've tried to switch the instance type from t2.micro
to m3.medium
, but the problem still persists. The steps that I've done are as the following:
New and old images
view type).CREATING
to ACTIVE
. Therefore, I use larger instance type (m3.medium) for this particular step.TableCopy has reached terminal state, running callback for TaskStatus COMPLETE
message from Cloudwatch log, DynamoDBTableCopy will stop and DynamoDBReplicationConnector will start. At this point, I changed 1000 write throughput at (2) to 10 write throughput.running
in both Docker and ECS agent. In addition, there is no additional log events in Cloudwatch.Thank you for your assistance.
There are some sporadic errors in DynamoDBCrossRegionReplicationConnectors but I'm not sure whether these errors are related or not (because the timestamp is different).
com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask - Application exception. java.lang.IllegalArgumentException: Application didn't checkpoint at end of shard shardId-00000001451986853686-ed7562c6
com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask - Caught exception: com.amazonaws.services.kinesis.clientlibrary.exceptions.internal.KinesisClientLibIOException: Incomplete shard list: Closed shard shardId-00000001451156300838-79379b26 has no children.This can happen if we constructed the list of shards while a reshard operation was in progress.
com.amazonaws.services.dynamodbv2.streams.connectors.DynamoDBReplicationEmitter$1 - Exception emitting record: {EventID: 0392dfe6307cb091ab87e7f48de3e0b9,EventName: MODIFY,EventVersion: 1.0,EventSource: aws:dynamodb,AwsRegion: ap-northeast-1,Dynamodb: {...,SequenceNumber: 268399600000000000538999663,SizeBytes: 535,StreamViewType: NEW_AND_OLD_IMAGES}} com.amazonaws.AmazonClientException: Unable to execute HTTP request: Read timed out
Hi,
We have root caused the issue and shipped out a workaround that should resolve the issue. Please delete and re-create your replication group and monitor if this issue is still present.
Thank you.
Thanks for your quick fix. I will try to recreate the replication groups and share the result here later on.
Hello,
I've noticed two separate issues here.
1) The operation of TableCopy seems to work, but I don't see any write metrics in my CloudWatch Monitoring here. Is it intentional?
But I still received this message when I tried to decrease the write throughput:
2) There's a problem with DynamoDBReplicationConnector.
Upon further inspection into the instance, I found the following logs:
--2016-01-13 03:05:15-- https://s3.amazonaws.com/dynamodb-cross-region/DynamoDBConnectors.jar
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.34.40
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.34.40|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2016-01-13 03:05:18 ERROR 403: Forbidden.
UPD:
It seems to work miraculously 20 minutes later. I will report it here again if something happens.
Problem 1 only occurs during DynamoDBTableCopy task.
It's working properly. Thank you for the fix, I'll close this issue.
Initial condition
I have 6 DynamoDB tables which are replicated throughout two different regions. After several weeks of running without encountering this reported problem, I've realized that DynamoDB replication console does not replicate LSI, so I need to create the table by myself.
I've removed all replication groups from the replication console and I've deleted all tables from the replicas region. After that, I recreated those fresh tables (with same name as the previous one) with LSI. Then, I recreated the replication group with the existing tables in both regions.
Problem
After finishing DynamoDBTableCopy task, DynamoDBReplicationConnector is executed. However, at a certain point, the replication in 3 out of 6 tables stopped running silently.
As you can see in the screenshot, it works for an hour and half before it fails. I've tried to remove the replication group again, remove KCL checkpoint table for those replication groups, and recreate it again without avail.
In addition, I've tried to increase the number of workers from 1 to 3 (for each replication groups with 1 Master and 1 replica). This problem keeps occurring, even with small throughput (< 5 operations per second).
Current condition
ECS still shows that my task is running properly.
I've decided to access the machine directly and executed
docker ps
inside the machine. The process is still running up till now.I've checked the CloudWatch log, but it does not show any error messages. The last error message that I receive, is not related to this problem (from the timestamp, > 24 hours ago).
The only pattern that I notice is
leaseCounter
in KCL tables. In the failed groups, the number ofleaseCounter
keeps increasing periodically until it reaches 2000 in less than 24 hours (~ 13 hours). In the running groups, the number ofleaseCounter
is less than 700.Any idea regarding the cause of this problem? Thank you.