Open marcocitus opened 7 years ago
I'm adding the v6.1 milestone with the intent to spend a day on this and to document scenarios where we could have safety issues.
@ozgune @marcocitus
document scenarios where we could have safety issues.
If I tried to document scenarios, it wouldn't be possible to find all UDFs
that could lead to problems in half-a-day. Instead, I preferred to list the dangerous UDFs
so that we could prioritize (either to document scenarios or fixing it). Do we want to fix some of the top items in the list for 6.1?
start_metadata_sync_to_node()
: Iterates through all shards/placements for MX. Potentially could send missing/wrong information to workers while initiliazed to be an MX node.master_apply_delete_command()
: Dangerous while applying the command rebalancer might move some placements. Those placements might not get the delete command.mark_tables_colocated()
: Checks shard/shard placements of distributed relations. Could lead to marking non-colocated tables as being co-located.master_drop_distributed_table_metadata()
: It uses worker_drop_distributed_table()
so fixing the below should fix this as well.
worker_drop_distributed_table()
: Iterates over shard/shard placements to drop on the workers for MX. master_drop_all_shards()
: Iterates over shards to drop on DROP TABLE
command. Might lead to not drop some of the placements (replicate) or some orphaned shards(rebalance). master_get_table_metadata()
: Due to shard replication factor in the output.master_update_shard_statistics()
: It uses FinializedShardPlacmentList()
master_expire_table_cache
: Iterates over shards/shard placements.master_stage_shard_row()
: There is a comment saying that only used for csql
. So, probably should be deleting the UDF itself.master_stage_shard_placement_row()
: There is a comment saying that only used for csql
. So, probably should be deleting the UDF itself.UDFs
that are protected by both shard metadata and/or shard resource locks (i.e, Already works fine OK with rebalancer):master_modify_multiple_shards()
master_append_table_to_shard()
upgrade_to_reference_table()
master_add_node()
UDFs
that seem to not require any locksmaster_create_empty_shard()
worker_
are mostly OK.Since the scope for 6.1 is to investigate and we already did, the 6.1 tag should be removed from the issue. Any objections?
Since the scope for 6.1 is to investigate and we already did, the 6.1 tag should be removed from the issue. Any objections?
I think we should create a 6.2 release milestone and move issues like this to there? @ozgune, @sumedhpathak?
@metdos that works for me. @ozgune do we intend to work on some of these in 6.2? Else we can just remove the 6.1 milestone to mark it as closed?
We have a number of code paths that use placement metadata in an unsafe way, namely without first obtaining a shard metadata lock, meaning they are allowed to run concurrently with a shard repair/copy/move and might use stale metadata. Even if we do obtain a lock, we also need to make sure that changes in shard metadata made by repair/copy/move are visible once the lock is obtained, which may require a new snapshot. Not doing so may result in incorrect results, inconsistent replication, or data loss, when these code paths are exercised concurrently with a shard placement change.