dolthub / dolt

Dolt – Git for Data
Apache License 2.0
17.72k stars 503 forks source link

Add optimized diffing and three-way merge of indexed JSON Documents. #8129

Closed nicktobey closed 1 month ago

nicktobey commented 1 month ago

This PR adds some additional tests, but I plan on adding more tests around large documents before merging. Still, the implementation is ready for review.

This adds a new JSON diffing algorithm designed for IndexedJSONDocument. Because three way merge only operates on values read from a Dolt table, which are always returned as a IndexedJSONDocuemt, this should mean that the original implementation is no longer used.

coffeegoddd commented 1 month ago

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
260f8bf ok 5937457
version total_tests
260f8bf 5937457
correctness_percentage
100.0
coffeegoddd commented 1 month ago

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
ef0e252 ok 5937457
version total_tests
ef0e252 5937457
correctness_percentage
100.0
coffeegoddd commented 1 month ago

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
5e0990f ok 5937457
version total_tests
5e0990f 5937457
correctness_percentage
100.0
coffeegoddd commented 1 month ago

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
1399c9f ok 5937457
version total_tests
1399c9f 5937457
correctness_percentage
100.0
github-actions[bot] commented 1 month ago
@coffeegoddd DOLT test_name detail row_cnt sorted mysql_time sql_mult cli_mult
batching LOAD DATA 10000 1 0.06 1.5
batching batch sql 10000 1 0.08 1.88
batching by line sql 10000 1 0.08 1.75
blob 1 blob 200000 1 0.94 3.76 3.66
blob 2 blobs 200000 1 0.93 4.3 4.17
blob no blob 200000 1 0.97 2.29 1.96
col type datetime 200000 1 0.84 2.99 2.8
col type varchar 200000 1 0.71 3.54 3.21
config width 2 cols 200000 1 0.81 2.52 2.1
config width 32 cols 200000 1 1.9 2.01 2.47
config width 8 cols 200000 1 0.98 2.4 2.27
pk type float 200000 1 0.95 2.16 1.82
pk type int 200000 1 0.78 2.92 2.17
pk type varchar 200000 1 1.5 1.76 1.59
row count 1.6mm 1600000 1 5.77 2.91 2.47
row count 400k 400000 1 1.5 2.77 2.3
row count 800k 800000 1 2.93 2.84 2.4
secondary index four index 200000 1 3.65 1.35 1.05
secondary index no secondary 200000 1 0.9 2.47 2.09
secondary index one index 200000 1 1.14 2.43 2.11
secondary index two index 200000 1 2 1.74 1.47
sorting shuffled 1mm 1000000 0 5.3 2.77 2.46
sorting sorted 1mm 1000000 1 5.28 2.77 2.47
github-actions[bot] commented 1 month ago
@coffeegoddd DOLT name detail mean_mult
dolt_blame_basic system table 1.25
dolt_blame_commit_filter system table 3.34
dolt_commit_ancestors_commit_filter system table 0.81
dolt_commits_commit_filter system table 1.1
dolt_diff_log_join_from_commit system table 2.09
dolt_diff_log_join_to_commit system table 2.12
dolt_diff_table_from_commit_filter system table 1.15
dolt_diff_table_to_commit_filter system table 1.15
dolt_diffs_commit_filter system table 0.93
dolt_history_commit_filter system table 1.53
dolt_log_commit_filter system table 1.1
github-actions[bot] commented 1 month ago
@coffeegoddd DOLT name add_cnt delete_cnt update_cnt latency
adds_only 60000 0 0 0.71
adds_updates_deletes 60000 60000 60000 3.77
deletes_only 0 60000 0 1.87
updates_only 0 0 60000 2.41