sunxiayi commented 2 months ago

Summary:

Summary

When synchronizing engines, get the synchronization coordinates from P_S.log_status table, send from the server plugin to client plugin. Upon receiving, client plugin writes to a file #clone/#synchronization_coordinates.

Approach

Protocol Add a new response type COM_RES_GTID_V4 and update latest protocol to CLONE_PROTOCOL_VERSION_V4.

Common synchronize_engines() call was in Ha_clone_common_cbk. But it actually is only called from server or local, not client. So I move this function into the child class(client, server, local, with client implementation is a no-op). The reason I make this change is because, I want to send the gtid from the plugin layer without calling into engine again, and moving synchronize_engines() into Server_Cbk has the advantage that I can get the server handler, while in Ha_clone_common_cbk I cannot.

The log_status query and set_log_stop steps are moved into a new function synchronize_logs under Ha_clone_common_cbk.

populate_synchronization_coordinates would populate a Key_Values data structure of:

gtid(from log_status table)
binlog_file
binlog_offset
gtid(from binlog_file and binlog_offset)

We want to record both gtid from log_status and from binlog_file/offset because they seem to be out of sync in prod, and need a way to confirm this.

Client synchronize_engines() is a no-op and errors out upon calling.

Upon receiving COM_RES_GTID_V4, deserialize the coordinates and persist that in #clone/#synchronization_coordinates.

Server synchronize_engines() would call synchronize_logs and get the server handle to send the coordinates one by one, utilizing existing helper functions.

Local synchronize_engines() would call synchronize_logs, then get the client handle to persist coordinates in #clone/#synchronization_coordinates.

Handle version mismatch In server, only send coordinate if negotiated version >= V4. In client, #synchronization_coordinates file is cleaned upon start of clone.

Test Plan:

MTR

local_create_synchronization_coordinates remote_create_synchronization_coordinates

Local clone

Test on my debug build in devserver after install plugin clone SONAME 'mysql_clone.so';: 1/ mysql> CLONE LOCAL DATA DIRECTORY = '/home/sunxiayi/mysql/mysql-fork/_build-8.0-Debug/mysql-test/var/tmp/mysqld.1/data_new'; 2/ Check synchronization coordinate from log_status is the same as in file:

mysql> select local from performance_schema.log_status;
+------------------------------------------------------------------------------------------------------------------------------------+
| local                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------+
| {"gtid_executed": "e0e4654f-00fb-11ef-815f-95ead878d1b0:1-3", "binary_log_file": "master-bin.000001", "binary_log_position": 1019} |
+------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

2024-03-29T21:35:31.654256Z 13 [Note] [MY-013273] [Clone] Plugin Clone reported: 'Server: w_local: {"gtid_executed": "279dd150-f06c-11ee-abb5-60fac8fc94f1:1-4", "binary_log_file": "master-bin.000001", "binary_log_position": 1019}.'

# with binlog
[sunxiayi@devvm7592.atn0 ~/mysql/mysql-fork/_build-8.0-Debug/mysql-test/var/tmp/mysqld.1/data_new_1/#clone (128f0405)]$ cat '#synchronization_coordinates'
gtid_from_log_status
2287bbe2-02be-11ef-aef5-9d0d68e44093:1-2
binary_log_file
master-bin.000001
binary_log_position
736
gtid_from_binlog_file_offset
2287bbe2-02be-11ef-aef5-9d0d68e44093:1-2

# without binlog
[sunxiayi@devvm7592.atn0 ~/mysql/mysql-fork/_build-8.0-Debug/mysql-test/var/tmp/mysqld.1/data_new_2/#clone (128f0405)]$ cat '#synchronization_coordinates'
gtid_from_log_status
2287bbe2-02be-11ef-aef5-9d0d68e44093:1-2
binary_log_file
master-bin.000001
binary_log_position
736

Remote clone, normal

1/ Take udb35350.ftw5:3301 as the test server, install my debug build. Take udb12221.atn5:3301 as the client server, install my debug build. Install plugin on both servers. 2/ On udb12221.atn5:3301, do

(admin:sys.database@udb12221.atn5 3301) [(none)]> SET GLOBAL clone_valid_donor_list = 'udb35350.ftw5.facebook.com:3301';
(admin:sys.database@udb12221.atn5 3301) [(none)]> set global clone_enable_compression="on";
(admin:sys.database@udb12221.atn5 3301) [(none)]> set enable_block_stale_hlc_read=0;
(admin:sys.database@udb12221.atn5 3301) [(none)]> set allow_noncurrent_db_rw=on;
(admin:sys.database@udb12221.atn5 3301) [(none)]> CLONE INSTANCE FROM 'dba_scripts:sys.database'@'udb35350.ftw5.facebook.com':3301 IDENTIFIED BY '' REQUIRE SSL;

3/ Check the new file is written correctly 4/ Repeat the clone, file is overwritten correctly.

Remote clone, sev scenario, apply-logs

Issue a clone command using instance whose mysql.gtid_executed table has hole as donor in the raft world. Also this is copying from secondary to secondary meaning binlog is an apply-logs. Check file on recipient is correct.

Remote clone, donor and client version mismatch

Only update recipient. Clone finishes, file is not there.

[root@udb12221.atn5 /data/mysql/3304/#clone]# ls
'#old_files'  '#replace_files'  '#status_fix'  '#view_progress'  '#view_status'

Only update donor. Clone finishes, file is not there.

[root@udb12221.atn5 /data/mysql/3305/#clone]# ls
'#old_files'  '#replace_files'  '#status_fix'  '#view_progress'  '#view_status'

Update both and do a successful clone. Then only update client. File is not there.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: https://phabricator.intern.facebook.com/D55614528

facebook-github-bot commented 2 months ago

Hi @sunxiayi!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

facebook-github-bot commented 2 months ago

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

sunxiayi commented 1 month ago

I am sorry for a partial review, but a full review depends on how some of the current comments are going to be resolved.

What is the plan with the two sets of binlog positions, once the mismatch is fixed or its absence is confirmed?

in the client side, we can reset replica and set global gtid_purged so it can resume replication from that gtid.

laurynas-biveinis commented 1 month ago

sunxiayi @.***> writes:

What is the plan with the two sets of binlog positions, once the mismatch is fixed or its absence is confirmed?

in the client side, we can reset replica and set global gtid_purged so it can resume replication from that gtid.

Right, but I am asking specifically about the two sets of binlog positions that should be equal: one from pfs.log_status and one from the binlog itself.

sunxiayi commented 1 month ago

sunxiayi @.**> writes: What is the plan with the two sets of binlog positions, once the mismatch is fixed or its absence is confirmed? in the client side, we can reset replica and set global gtid_purged so it can resume replication from that gtid. Right, but I am asking specifically about the two sets* of binlog positions that should be equal: one from pfs.log_status and one from the binlog itself.

If the two sets match, we would use that gtid to resume replication in client. If they are not the same, we would abort the clone, log the mismatch details, then make plans to fix the bug.

facebook-github-bot commented 1 month ago

@sunxiayi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook / mysql-5.6

[clone] Persist synchronization gtid from P_S.log_status #1450