citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.53k stars 667 forks source link

Unsafe to run shard copy concurrently with various functions and commands #913

Open marcocitus opened 7 years ago

marcocitus commented 7 years ago

We have a number of code paths that use placement metadata in an unsafe way, namely without first obtaining a shard metadata lock, meaning they are allowed to run concurrently with a shard repair/copy/move and might use stale metadata. Even if we do obtain a lock, we also need to make sure that changes in shard metadata made by repair/copy/move are visible once the lock is obtained, which may require a new snapshot. Not doing so may result in incorrect results, inconsistent replication, or data loss, when these code paths are exercised concurrently with a shard placement change.

ozgune commented 7 years ago

I'm adding the v6.1 milestone with the intent to spend a day on this and to document scenarios where we could have safety issues.

onderkalaci commented 7 years ago

@ozgune @marcocitus

document scenarios where we could have safety issues.

If I tried to document scenarios, it wouldn't be possible to find all UDFs that could lead to problems in half-a-day. Instead, I preferred to list the dangerous UDFs so that we could prioritize (either to document scenarios or fixing it). Do we want to fix some of the top items in the list for 6.1?

UDFs that we should consider to work with shard rebalancer safely (in the order of importance(?)):

The UDFs that are protected by both shard metadata and/or shard resource locks (i.e, Already works fine OK with rebalancer):

The UDFs that seem to not require any locks

Some of the observations / notes:

onderkalaci commented 7 years ago

Since the scope for 6.1 is to investigate and we already did, the 6.1 tag should be removed from the issue. Any objections?

metdos commented 7 years ago

Since the scope for 6.1 is to investigate and we already did, the 6.1 tag should be removed from the issue. Any objections?

I think we should create a 6.2 release milestone and move issues like this to there? @ozgune, @sumedhpathak?

sumedhpathak commented 7 years ago

@metdos that works for me. @ozgune do we intend to work on some of these in 6.2? Else we can just remove the 6.1 milestone to mark it as closed?