WICG / protected-auction-services-discussion

Other
13 stars 2 forks source link

KV Server Clarifications: Sharding Function, Static/Dynamic vs In Flight Resharding, Cluster/Replica Awareness #56

Closed thegreatfatzby closed 2 months ago

thegreatfatzby commented 6 months ago

Couple of questions:

Static/Dynamic vs In Flight Resharding

Here I see that we'll only have support for static resharding and can consider dynamic later...but two paragraphs down I see reference to "in flight resharding". In a more recently updated doc I see "in flight sharding not yet supported" here.

So few clarifications:

Sharding Function

Based on this code and the statement "It is a SHA256 mod number of shards", is it right that the sharding function is not controllable by the ad tech?

Cluster/Replica Awareness

Am I getting these distincts correct:

Are there any limits to replication? Can it go across data centers? Any geographic limitations?

UDFs

One UDF per KV cluster?

lx3-g commented 5 months ago

Hello Isaac Foster,

Sorry for the delayed response.

First of all, since you haven't linked this doc anywhere in your question I want to make sure you saw it.

Static/Dynamic vs In Flight Resharding Here[1] I see that we'll only have support for static resharding and can consider dynamic later...but two paragraphs down[2] I see >reference to "in flight resharding". In a more recently updated doc I see "in flight sharding not yet supported" here[3].

So few clarifications:

Can you clarify the distinction, if any, between dynamic sharding (as opposed to static) compared to "in flight resharding"? Is the more up to date document correct?

Link #2 and #3 are pointing to the same doc. Did you mean them to be different?

To clarify the distinction between static and dynamic sharding:

In-flight resharding capability means a way for a solution that uses static sharding to change from a number of shards N to a number K without downtime. So I believe that given the above definitions, all the documents are correct.

is it right that the sharding function is not controllable by the ad tech?

yes

Writing: client must be shard aware, not replica aware, i.e. Ad tech client must shard data set and send data to correct shard in it's code. But only needs to send to one of the servers that handles that shard, and consistency is handled internally?

Technically, you don't have to be shard aware. You can write your data as if the solution was not sharded. The data will be loaded properly to proper shards and replicas. However, in the interest of speed and resource efficiency, you can mark a delta file as only containing records for a specific shard, and then it will not be read by any other shard. More here. Note that you're never sending delta to a server, but always uploading it to a specific bucket.

Also similar logic applies for the Realtime update path.

Reading, client need not be shard or replica aware? I.e. it can send a read to any host and, if the key being read is in a replica set in the cluster whether on the originating host or not, it will be found?

yes

Are there any limits to replication?

The replication is managed by auto scaling groups. So all the same limits apply.

Can it go across data centers? Any geographic limitations?

as per above, and for example in case of AWS: https://aws.amazon.com/ec2/autoscaling/faqs/

Q: Can EC2 Auto Scaling groups span multiple AWS regions?

EC2 Auto Scaling groups are regional constructs. They can span Availability Zones, but not AWS regions.

One UDF per KV cluster?

At the moment we support only one UDF for all clusters.

I hope that helps.