Migrations fail with no reported error

andydunstall commented 2 weeks ago

Describe the bug

Creating a two shard cluster (25GB per shard), then migrating slots between each shard, seeing migrations fail with no reported error. SLOT-MIGRATION-STATUS reports the migration as failing with an error of 0.

The relevant logs on the source seem to be:

streamer.cc:166] Stream timed out, inflight bytes/sent start: 75873/51627, end: 1757224440/55842
streamer.cc:166] Stream timed out, inflight bytes/sent start: 1053/55842, end: 1757224440/55842
streamer.cc:166] Stream timed out, inflight bytes/sent start: 47410/65326, end: 1637352416/65326
streamer.cc:166] Stream timed out, inflight bytes/sent start: 16859/65326, end: 1637352416/65326
...
outgoing_slot_migration.cc:135] Finish outgoing migration for node_xnohnb4jx : node_pwqmz61eg
outgoing_slot_migration.cc:135] Finish outgoing migration for node_xnohnb4jx : node_pwqmz61eg

I guess the stream time out is the cause?

So I have a few questions:

The migration is around 20GB, so I'm not sure why we're getting stream time outs? Can the target node just not keep up? Is there anything we should do to prevent this, such as limiting the size of each migration? (Such as migrate up to N slots at a time)
Should Dragonfly be retrying after a stream timeout instead of failing the migration?
Can Dragonfly report the cause of the error in SLOT-MIGRATION-STATUS

To Reproduce

Create a cluster with two 25GB shards, where one shard has all the slots
Migrate all slots from one shard to the other

Environment (please complete the following information):

OS: ubuntu
GCP n2d-highmem-4
Dragonfly Version: main

andydunstall commented 2 weeks ago

Trying again, it looks like Dragonfly does retry the migration, but after it reports that the migration failed (in SLOT-MIGRATION-STATUS)

So the control plane sees the migration failed and cancels it (then pages an operator), meaning the retry is cancelled. It seems like either Dragonfly should retry without reporting the migration as failed, or don't retry and leave it to the control plane?

andydunstall commented 2 weeks ago

Metrics show the target shard was ~100% CPU, so I guess the source is just writing faster than the target can read? In which case is the replication timeout just too short? Or should we be doing smaller migrations (such as instead of a 5000 slot migration, do 5x1000 slots)?

andydunstall commented 2 weeks ago

On the issue of no errors being reported in SLOT-MIGRATION-STATUS, I guess it's because OutgoingMigration::SyncFb immediately resets cntx_ after an error occurs, so OutgoingMigration::GetError returns empty even though state_ is ERROR

kostasrim commented 2 weeks ago

@BorysTheDev

BorysTheDev commented 1 week ago

@andydunstall The issue with the target node is slower processing the data than the source node can already be fixed by https://github.com/dragonflydb/dragonfly/issues/3938 Migration should work correctly even if you decide to migrate the whole DB, so don't do any limitations. Regarding restarting the migration process we do it automatically, and the error that you see in the status is the last error that we detected to help us understand the issue is we have it Regarding the absence of the error, we have an error but later. I've checked the code and found out that we don't stop streamer fiber if we have an error and it takes some time to report it. So I'm going to do this next. It's not critical but save time.

andydunstall commented 1 week ago

Ok thanks for the update!

BorysTheDev commented 1 day ago

will be fixed by https://github.com/dragonflydb/dragonfly/pull/4081

dragonflydb / dragonfly

Migrations fail with no reported error #4080