Open andydunstall opened 2 weeks ago
Trying again, it looks like Dragonfly does retry the migration, but after it reports that the migration failed (in SLOT-MIGRATION-STATUS
)
So the control plane sees the migration failed and cancels it (then pages an operator), meaning the retry is cancelled. It seems like either Dragonfly should retry without reporting the migration as failed, or don't retry and leave it to the control plane?
Metrics show the target shard was ~100% CPU, so I guess the source is just writing faster than the target can read? In which case is the replication timeout just too short? Or should we be doing smaller migrations (such as instead of a 5000 slot migration, do 5x1000 slots)?
On the issue of no errors being reported in SLOT-MIGRATION-STATUS
, I guess it's because OutgoingMigration::SyncFb
immediately resets cntx_
after an error occurs, so OutgoingMigration::GetError
returns empty even though state_
is ERROR
@BorysTheDev
@andydunstall The issue with the target node is slower processing the data than the source node can already be fixed by https://github.com/dragonflydb/dragonfly/issues/3938 Migration should work correctly even if you decide to migrate the whole DB, so don't do any limitations. Regarding restarting the migration process we do it automatically, and the error that you see in the status is the last error that we detected to help us understand the issue is we have it Regarding the absence of the error, we have an error but later. I've checked the code and found out that we don't stop streamer fiber if we have an error and it takes some time to report it. So I'm going to do this next. It's not critical but save time.
Ok thanks for the update!
will be fixed by https://github.com/dragonflydb/dragonfly/pull/4081
Describe the bug
Creating a two shard cluster (25GB per shard), then migrating slots between each shard, seeing migrations fail with no reported error.
SLOT-MIGRATION-STATUS
reports the migration as failing with an error of0
.The relevant logs on the source seem to be:
I guess the stream time out is the cause?
So I have a few questions:
SLOT-MIGRATION-STATUS
To Reproduce
Environment (please complete the following information):
n2d-highmem-4
main