Idea: Use the Sync backend

ckarlof commented 10 years ago

I said I would never do this, but Toby has suggested we look at the Sync backend for general service.

Questions I have:

Durability, but this may just mean backing up the DB.
Use of timestamps instead of versions, which also may be fine as long as clients stop using their own time.
Encryption. I don’t want to require it. How much do we need to contort Sync to expect plaintext data?
OAuth integration. Doesn’t exist yet, but it’s not that hard.

Other issues?

ckarlof commented 10 years ago

I said I would never do this

Actually, I said I wouldn't use the existing Sync service. Building on the existing Sync codebase is an idea I hadn't even considered before.

ckarlof commented 10 years ago

From Toby:

1) Durability, but this may just mean backing up the DB.

Yeah, you can add replication to the DB fairly trivially. We don't do it because it's not worth the cost given the data profile.

2) Use of timestamps, which also may be fine as long as clients stop using their own time.

Timestamps are needed for syncing, but should be ignorable for kv lookups. But, yes, clock drift is a horrible, horrible thing. We've had plans around for fixing that in 2.0, but it's a lot of client work.

3) Encryption. I don’t want to require it. How much do we need to contort Sync to expect plaintext data?

Zero. Sync just accepts data. It has no idea if it's encrypted or not.

4) OAuth integration. Doesn’t exist yet, but it’s not that hard.

That would be best done through tokenserver and, knowing rfkelly, has already been done.

Doing it in sync directly is also doable.

ckarlof commented 10 years ago

Timestamps are needed for syncing, but should be ignorable for kv lookups. But, yes, clock drift is a horrible, horrible thing. We've had plans around for fixing that in 2.0, but it's a lot of client work.

I want what timestamps are trying to achieve, which are incremental fetches since some version of the collection. I just don't want to use real time to do it.

ckarlof commented 10 years ago

4) OAuth integration. Doesn’t exist yet, but it’s not that hard.

That would be best done through tokenserver and, knowing rfkelly, has already been done.

Seems too complicated. Why not just have the server access OAuth tokens directly? Plus the Hawk nonsense is too complicated as well.

ckarlof commented 10 years ago

/cc @rfk

telliott commented 10 years ago

Tokenserver gives you an infrastructure that's already in place. It doesn't have to be Hawk, that's just implemented for Sync 1.5. It supports any number of auth protocols. The point of the tokenserver is to abstract the auth away from the service and give you user sharding for free.

But, sync is also built with pluggable authentication (that currently reads tokens) and an oauth library should be pretty doable. Would need to keep a local userid db.

telliott commented 10 years ago

Clock drift is easy to avoid. Just never look at your local time, and cache the returned server timestamps (which you'd have to do for any version).

rfk commented 10 years ago

I want what timestamps are trying to achieve, which are incremental fetches since some version of the collection. I just don't want to use real time to do it.

I strongly agree, and this was one of the things we really wanted to get done in Sync2.0 but had to abandon for Sync1.5.

A sync-like db seems like a good fit for our problem space, but sync itself has some legacy baggage that would be good to avoid in a new system:

Timestamps as versions. Using opaque version identifiers would be much much nicer.
Sorting. Please let's not expose sortindex to new code. Preferably we would not even expose the choice between ascending/descending timestamp order, just a plain "fetch the changes since this version" operation.
Client-visible sharding. This is mostly aesthetics I guess, and it does have concrete benefits w.r.t how easily we can scale things, but I find the whole endpoint-url-fetching dance to be plain ugly from an API design perspective (sorry tokensever!)

(Side note: you may notice that sync-with-those-changes would be very very close to the CouchDB data model.)

Something that sync doesn't have is any notion of oauth's third leg - it stores data indexed by userid, but not based on clientid. IIUC having relier-specific storage buckets is an explicit design goal here, and that may not be easy to shoe-horn into sync.

There's a lot of value in re-using existing working code, but my gut says that the missing third leg will tip this from "quick and easy, if a little ugly" squarely over to "not quite the right fit for this problem".

rfk commented 10 years ago

By the way, "build an identity-attached key-value store" is basically what I was hired for three years ago. So I may tend to have some strong opinions on the topic. But I'm really excited to see momentum behind getting such a thing built! :smile:

ckarlof commented 10 years ago

Client-visible sharding. This is mostly aesthetics I guess, and it does have concrete benefits w.r.t how easily we can scale things, but I find the whole endpoint-url-fetching dance to be plain ugly from an API design perspective (sorry tokensever!)

I don't really like it either, and as a developer, it has a lot of friction to get started with. If we wanted to build a more closed API (e.g., only client side interfaces on our user agents), it might make more sense.

ianb commented 10 years ago

What are we trying to optimize for? Latency of communication through the server? Cost of data at rest? Large chunks of data or small? Chatty updates or infrequent updates? Is this likely to be a (generic) sync backend for primarily client-based storage? High value data, or data that can be lost without too many repercussions?

Firefox Sync is half key-value and half time series. IMHO we should pick one – or pick both and implement two kinds of backends.

I personally think time series is more interesting. Key-value implies that there is some particular metadata (the "key") that can be known a priori by clients. It also tends to suggest that the cloud is used as the canonical source of information, and therefore broad change detection is not necessary. You can add change detection (as Firefox Sync does) but at that point do you actually need the keys?

Using a time series system you can bring all clients up to an accurate and self-consistent state with a minimum of interactions. If there is a use case where coming all the way up to date is too inefficient (i.e., there is data that is available but not interesting) then some indirection may be called for, with metadata being held in the time series system and some larger raw data being referenced by that metadata, and loaded only lazily. Without the metadata, the client can only tell what's interesting by doing essentially server-supported queries. Type queries are a common one (something we see in Sync Collections), but are not complete.

It would be helpful to have a list of use cases to analyze how different models would support those use cases.

rfk commented 10 years ago

What are we trying to optimize for?

It's a glib answer, but: "developer experience". This thing is likely to be an 80% solution for 80% of FxA-attached service use-cases. Which probably means being a jack-of-all-trades and master-of-none on all the dimensions you highlighted.

dannycoates / fxa-kv-server

Idea: Use the Sync backend #5