Delta sync - Githubissues

ntziolis commented 5 years ago

A generalized way to handle changes in data on the server while client was offline without refetching all data after reconnect as well as transparently execute initial data loading without requiring user to have all data locally already for app to function.

I completely understand that this is not a client only feature but this library seemed to be the best place to start a discussion on how we would go about establishing a standardized way to achieve this capability (server to client data replication) when leveraging a graphql backend

Describe the solution you'd like

Client side API that allows to specify which data should be delta synced from the server to the client and how in a non boilerplate manner
Server side requirements to support delta sync
Establish overridable conventions

Describe alternatives you've considered

AWSAppSync
- Its AWS only
- No support for apollo >2.5 local state management and local resolvers
Custom built on graphql
- Well ... it would be custom building a central component of what a lot of folks need
custom build on top of puchdb
- requires couchdb compatible db on server vs graphql endpoint

wtrocki commented 5 years ago

We can cover this requirement by:

Implementing a new cache layer on top of Apollo GraphQL InMemory storage. Working on Open Sourcing that part in https://github.com/graphql-heroes/ApolloQLCached
Integrating QLCached with ApolloOfflineClient and providing great UX for exchanging data.

It is really glad to see this request coming from the community as it validates that this use case can be useful for people. General target will be to support diffs in OfflineClient. Going to create Roadmap issue soon so we can put some timing on when and how things will be delivered.

@ntziolis Do you have time to collaborate on requirements/approach for this?

wtrocki commented 5 years ago

Forgot to mention that we are not only client side and js. Our goal is to provide a comprehensive layer for both Client and Server side. Currently, we provide server-side package for conflicts and planning to do more in that space soon.

ntziolis commented 5 years ago

This is just awesome. I'd love to assist in this. My background is in building replication engines so lmk how I can participate. I think true offline capabilities that do NOT force single cloud vendor lock-in or custom solutions are the last major piece in the GraphQL all the way puzzle.

Since I now seem to be talking to the like minded I wanne run something by you I have been thinking about for a long time now:

Our goal should be to to handle data the same way on the clientside as on the serverside. For me the backend today starts clientside already. Really everything that retrieves and stores data I see as backend of my actual app. And it seems strange that we use different tool chains and APIs to handle the data.

So the goal should be to have have at least a subset of what is available server side on the client side as well, with the same API.

So: Assuming we have a way todo delta sync in a transparent way. One also needs a way to query data (incl filtering etc) on the client side. While this could be implemented manually in custom client side resolvers with apollos new client side state handling it would lead to a lot of handcrafted boilerplate code. Instead I'd love to build something like Prisma/joinmonster etc but for the client side.

Further when making sure such a layer is configurable in the sense which operations (eg filters like equals, contains etc ) are available and how they are exposed (down to customising the name and structure). It would be possible to provide at least a subset of the functionality available on the server which would allow to send any query to the server as well to ensure the user operates on the latest set of data.

What do you think about something like this? Or How would you go about consuming the cached data in a meaningful manner when trying to execute filtering etc clientside?

ntziolis commented 5 years ago

In regards to delta sync I wanted to get started with an initial list of factors and scenarios.

The following list keeps in mind both data storage and transfer requirements I see 3 main factors in play (standard replication stuff):

Create only
- think non changing master data
- on client no need to execute a get all after certain periods of time to make sure client has all the data
- on server no need to have a delta table as all that's needed is information about the order of records or a timestamp
Create, Update, Delete
- on client requires get all data calls at start + apply delta continuously + periodically to make sure client has all data
  - preventing periodic get all's comes at the high cost of storing and evaluating which client has which data
  - while this can be done it should not be a special use case but handled via partition key see below
- on server there is no way around a delta table
  - either handle all deltas in a delta tables
  - or only deletes in delta and updates/inserts via timestamp
    - this option reduces data to be stored on server and entangled cleanup duties as deltas should not be stored indefinitely
  - there should be Implementations for both weather to use a split strategy or have deltas for all changes should be up to the dev
Partitioning (applies to both of the above)
- allows for mixing 1. and 2. approaches on the same type but different partition key
- for example allows for splitting data of the same type into historic data (insert only) and current data (created/update/delete), heavily reducing get all call response size and potentially eliminating the need for per client deltas
- that said it would also allow to implement client specific deltas if needed

Server metadata storage strategies for sync:

Could be in the source data directly (think adding order/timestamp directly to a table etc.)
or in a completely different data source

Having the ability to choose between the two approaches is crucial to not requiring a specific implementation or downstream datasource capabilities for sync to work. While it should absolutely be possible to leverage them when available.

Feel free to let me know if im overshooting.

wtrocki commented 5 years ago

I think true offline capabilities that do NOT force single cloud vendor lock-in or custom solutions are the last major piece in the GraphQL all the way puzzle.

This is very much the target of this. We might provide some out of the box deployment options later, but target here is to provide a flexible package that works out of the box with the existing backends.

Really like the ideas. There is no overshooting as from my point of view so many people are looking for anything like this for some time.

What do you think about something like this? Or How would you go about consuming the cached data in a meaningful manner when trying to execute filtering etc clientside?

Yes. This is pretty much sums it up and it is possible now in Apollo. When using local projection of the data we can work seamlessly. Example I always use is pagination:

We can have offline pagination and online pagination etc. For offline pagination, cache need to not only store individual page queries but reference entire dataset that will give pagination options.

This quite challenging task as it will involving:

Providing custom CacheLayer that normalizes to database
Providing an additional layer with subscriptions or periodical queries for updating projections.
Providing state aware Queries and Resolvers (same as @client but more out of the box)

For the moment we kinda focusing on giving fully featured offline behavior and great user experience related to that. Developers should be able to work seamlessly with any GraphQL objects (Files, Subscriptions etc.) Once we do that right, next stage will be to go towards storage improvements and deltas.

I will need more time to write a proposal on how DeltaSync will work. Then we can collect some feedback from industry and create individual github issues for collaboration.

In relation to the second comment. I will need to put some diagram on how this will work and write some proposal to open conversation. This is too large topic to simply draft that on the single github issue.

wtrocki commented 5 years ago

@ntziolis Thanks for reaching out. I'm going to work on a general proposal for diffing capability so we can collaborate better.

evelant commented 5 years ago

So the goal should be to have have at least a subset of what is available server side on the client side as well, with the same API.

This is exactly what Meteor.js does if you haven't checked it out. They let the client subscribe to a data set then transparently stream that data into a client side MongoDB implementation (minimongo) that matches the server's mongo client API. The server then tracks active subscriptions against the mongodb oplog and sends any diffs down to subscribed clients. It is totally transparent and reactive for the client. The client and server can easily share code because they have the same database API.

While it has its downsides (mongo lockin, scaling, performance, not very actively developed anymore) nothing has managed to match the meteor dx so far in my opinion. I think the idea of the client having the same data API as the server is key to unlocking a lot of code reuse and really powerful features. Meteor's architecture might be a good place to look for some inspiration.

I'm excited to see where this project goes! I'm evaluating using Apollo and Prisma for my react-native project and this seems like the missing piece of the puzzle. Unfortunately I don't have enough experience with apollo/graphql yet to contribute much but I would like to help wherever I can.

ntziolis commented 5 years ago

@AndrewMorsillo 100% agree on that meteor is exactly where we wanne end up from a dev experience perspective. In fact my team has used meteor to build our first 3 enterprise SaaS solutions, but by now have migrated them to a GraphQL (key reasons where seamless external rest service integration, manageability, long term framework support, no-lock-in to specific technologies / frameworks on the backend and scale issues).

The goal for is to build a data backend independent version of what meteor delivers in regards to data handling server/client-side. Once exists everyone can build their on providers for their data backend without tech stack lock-in.

evelant commented 5 years ago

@ntziolis I'm in the same boat as you. I'm switching from meteor to graphql for the same reasons in the next iteration of my project.

Agreed 100% on the goal. Providing what you get from meteor in a more open backend agnostic fashion will be the ultimate dream for js development.

wtrocki commented 5 years ago

I think the best way to start with this is simply to enable the application to Query specific data on the server and Subscribe for results when:

Application is starting
The application became online and it is on the foreground.

We currently have OfflineMutationsHandler that gives the capability to resend offline mutations.

Proposal for client

The client can have new methods for registering queries/subscriptions:

client.registerOnlineQuery(new OnlineQuery({gql,variables}))
client.registerOnlineSubscription(new OnlineSubscription({gql,variables}))

OnlineQuery/OnlineSubscription apart from having all required fields to perform mutation will contain metadata used to decide when to call mutate.

For example:

// Wait with request after becoming online
public initialDelay: number = 0; 
// Interval used for pooling
public interval?: 
// Even some extra metadata
public requiresWifi: boolean

Developers will be able to trigger Query Refresh manually (and force subscriptions to reconnect:

client.forceOnlineRefresh();

Related work

Extend OfflineMutationsHandler to handle subscriptions
Create abstraction for OnlineQuery and OnlineSubscription
Create Registry of OnflineQueryes/Subscriptions. Expose methods to register OnlineQueries and OnlineSubscriptions to registry
Connect NetworkState interface to interact with the registry and execute depending on medatadata in OnlineQuery and OnlineSubscription
Write unit tests and integration tests for this feature.

Open for comments, opinions and contributions We can create individual issues once the community will agree on the flow.

ntziolis commented 5 years ago

I think the best way to start with this is simply to enable the application to Query specific data on the server and Subscribe for results when: Application is starting / The application became online and it is on the foreground.

Totally agree with doing this step by a step and using a client side only approach in the first step that doesn't require server side changes. In addition we should make this as transparent as possible.

Looking at what watchQuery already provides:

refetch functionality is available and users are already familiar with it
subscribeToMore already provides an interface for telling a query that it will stay up to date
specifying a pollingInterval already triggers refetch based on a timer event

I think effectively what we want is what watchQuery already does + having it respond to the additional events (appstart/reconnect/foreground/delay etc.).

We could achieve this by wrapping the watchQuery with the additional functionality and exposing it via helper methods (like we discussed in the other issue) as well as an additional method on the client watchQueryWithOffline for easy of use while allowing for additional parameters to be passed in without breaking standard API.

In regards to the subscriptions connected to a watchQuery:

all (globally) previously active subscriptions automatically resubscribe when websocket is back online
so this is not something we actually need to handle on a per watchQuery basis
what we might wanne do though is wait for the websocket to be back online before executing fetch on the watchQuery to avoid having any watchQueries without an active subscription

wtrocki commented 5 years ago

Totally agree with doing this step by a step and using a client side only approach in the first step that doesn't require server side changes. In addition we should make this as transparent as possible.

We will follow up with server side node.js package but IMHO is best to start with client side usages first and try them to see if we even need anything from server or it can be done in framework user space.

I think effectively what we want is what watchQuery already does + having it respond to the additional events (appstart/reconnect/foreground/delay etc.).

Yes. This pretty much sums intentions here 💯

We could achieve this by wrapping the watchQuery..

Love it. Going to work on the base for that and post update in comming days.

all (globally) previously active subscriptions automatically resubscribe when websocket is back online

This is already there, however it is a very naive approach and we do not resubscribe on app restart.

what we might wanne do though is wait for the websocket to be back online before executing fetch on the watchQuery to avoid having any watchQueries without an active subscription

Awesome idea! I totally forgot about the fact that those should be interconnected. Currently, subscriptions and queries are like connected in user app rather than a framework. We simply reconnect by retrying to subscribe when offline. This is a very very naive approach as we have information when add becomes online.

alidcast commented 5 years ago

@wtrocki regarding Implementing a new cache layer on top of Apollo GraphQL InMemory storage is there a reason you chose not to use an existing js database such as Pouchdb to persist/sync the cache? and for those of us considering Pouchdb for these capabilities now, to what extent would this module be compatible?

ntziolis commented 5 years ago

@alidcastano The current status should be seen as stepping stone. Reusing an existing db project is absolutely something we are looking into, keeping in mind that the end goals are to:

leverage this not only as a simple store but to be able to transparently respond to graphql queries/mutations/subscriptions offline.
integrate the backend (data replication/sync etc.) via graphql to not impose specific technologies on the server side.

To your question: PouchDB and underlying CouchDB replication protocol are first in class when in comes to offline first clientside apps. But they do require CouchDB replication protocol compatible DBs on both client side and server side which greatly limits the number of projects that can leverage them. Part of what we are trying to achieve is a graphql based version of what pouchdb does really well today while not imposing specific db technologies on the backend (and obviously with a graphql api for the the fronted).

I'm still in the process of researching the fit of existing browser based in memory dbs out there for fit for this project, so if anyone has pointers to project not mentioned in the below list please feel free to pile on:

PouchDB / RxDB
LokiJS
Lovefield / ReactiveDB
AlaSQL

Update: Looking for in memory DBs as performance is key since it will replace any cache storage engine in addition to also handling offline query scenarios as we wanne avoid maintain 2 versions of the same data on the clientside (apart from persistence).

xtagon commented 5 years ago

👍 for not imposing specific technologies. Part of the appeal of the Apollo tools is that they can be glued together for slightly different stacks/use cases.

alidcast commented 5 years ago

@xtagon I just started looking into this space myself so there may be technical nuances I'm not seeing - but in general, data synchronization and conflict resolutions are hard problems to solve, why not use an existing, battle-tested solution in the interim? it'll be the difference between being able to use a production-ready solution next week versus next year.

apollo-servers pubsub implementation, for example, just provides an abstraction layer - to which the community can create their own tech specific implementations. the redis package being the most popular one right now.

I can understand not wanting to use a framework specific solution (such as redux for caching, which I'm glad Apollo moved away from) but there's lots of great work (and ecosystems!) in this space in JS land, why not take advantage of them? are there some incompatibilities I'm not aware of? seems like apollo-cache-persist already exposes the necessary API for it

alidcast commented 5 years ago

@ntziolis regarding CouchDB replication protocol compatible DBs, I likely need to look into this more but won't any implementations need to be compatible with specific databases? it seems like the only difference will be that it's some new Graphql protocol versus an already tested protocol - though I guess there might be some complexity involved in passing graphql queries/mutations back and forth

alidcast commented 5 years ago

@ntziolis here's a comment I found that adds to your database list: https://github.com/prisma/prisma/issues/1659#issuecomment-391129297

wtrocki commented 5 years ago

@alidcastano Thank you so much for listing that out. I will research the list of the databases that were provided.

wtrocki commented 5 years ago

After quick check I think we can list 2 categories of soluions:

Using GraphQL endpoint for synchronization and replicating queries that were made
Using internal mechanism with custom endpoint for replicating data.

Both will have some advantages and disadvantages. We already have database support in form of indexedDB but it is using just single key thru apollo-cache-persist. We can migrate to multiple keys and store cache data into literally any db but this will need more tighter integration with InMemory cache. Currently offix users can also use Hermes and Flache.

ntziolis commented 5 years ago

@alidcastano Thank you for the link, saw that previously but couldn't find it anymore so thank you!

just to be clear, my statements where oriented towards the end goal and should not mean that in the interim it wouldn't be a good idea to bridge with an existing technology. Im all for using an interim solution until we have this ironed out in a non technology specific manner.

My main concerns for choosing an interim solution are:

the complexity of hosting the server side parts. Lets say we would use PouchDB to ahndle the data replication piece, this means folks would need to host a CouchDB cluster (non trivial) which might not be something they are familiar with adding strain on their team to operate an additional stack and having to handle maintenance, Backup, DR etc.
folks would also need to implement data being moved in/out the CouchDB cluster from their main DB, this puts the entire strain of handling complex replication tasks onto every team using offix
last but not least the potential for feature reduction in our final solution. Given that CouchDB offers capabilities that not all other DBs do and it specific design around replicatability it is highly likely that it will offer some thing we will not be able to offer in a generalized version.

So as @wtrocki has rightfully put there are 2 distinct problems (interim and long term) and the tech choices for both are driven by different factors which likely result in different choices for each.

Just to shed more light on why a graphql based replication protocol makes sense:

won't any implementations need to be compatible with specific databases

Totally correct, the difference is the feasibility and complexity of such an implementation: For example the CouchDB protocol makes certain assumptions on DB hooks that simply do not exists on most commonly used SQL DBs hence making it impossible to implement the protocol. Apart from feasibility its quite complex and requires special knowledge about CouchDB way of thinking to implement such a protocol correctly. Our goal is to allow for offline enablement of an app by simply implementing additional graphql resolvers especially which one would already be very familiar with doing so given the use of graphql as the API for normal requests. Today if you wanne offline enable an app and you are on SQL you have to switch DB technologies todo it or pay VERY prohibitive licensing fees for the enterprise replication features (which are also limited in functionality)

Last I wanne say: I understand solving replication is a hard problem. In fact I have been building replication engines for the past 10 years and prior to graphql I haven't seen a technology that would allow for a generic stack solution hence I' committed in building this out because its a hard problem. Also I'm weird and I just love replication :)

alidcast commented 5 years ago

@ntziolis appreciate you writing this out -- it exactly aligns with what I've been learning these last two days, so glad it's nicely summarized here for others.

I'll also add this quote I found from an interview with one of the Pouchch maintainers:

Offline is really difficult. It’s one of those things that’s even missing from a university computer science education. What folks don’t realize is, when you’re building an offline-first application, you are essentially building a distributed system: client and server. Just by storing data on those two nodes, you have all the theoretical problems of the CAP theorem: consistency, availability, and partition tolerance — pick two.

So if you’re building that kind of system, but don’t realize it going in, you’ll probably end up just hacking something together. You may think you got 100% of the way there, but you really only got 90%, and the remaining 10% may take years to finish. It’s taken years to fix all the edge cases in PouchDB.

the Pouchdb/Couchdb combo just seems like the most out of the box, production ready solution for offline support, which is the reason that I mentioned it. I personally prefer SQL but there doesn't seem to be an equivalent stack for it yet (I wonder if there's a particular reason for that)

regarding your third point, about exposing database specific APIs - to what extent can that even be avoided? if you look at orms like Knex.js, for example, even they expose certain fields/methods that are only available in certain SQL dialects.

totally agree with prior to graphql I haven't seen a technology that would allow for a generic stack solution, that's an exciting advent that might merit its on solution. glad to at least have people with an affinity for this sort of problem working on it 👍

ntziolis commented 5 years ago

I personally prefer SQL but there doesn't seem to be an equivalent stack for it yet (I wonder if there's a particular reason for that)

Generally there are replication solutions for SQL DBs available that work really well but they are extremely cost prohibitive (often priced per replication endpoint) and functionality wise more geared towards internal enterprise use cases.

regarding your third point, about exposing database specific APIs - to what extent can that even be avoided?

I was referring to what level of service can be expected from the replication backend. Some replication use cases possible in Pouch will be hard to generalize hence will not be supported.

if you look at orms like Knex.js, for example, even they expose certain fields/methods that are only available in certain SQL dialects

This is why we need to decouple the offline engine from the "dialect" being used. In regards to the offline engine we need have generalized requirements that allow the engine to work with all kinds of data backends. This is more about form and types of possible filters. NOT about how they are expressed.

The dialect to request data however should be up to each project. And I think herin lies the beauty of graphql as it allows each model (or even query) to use its own dialect. Think prisma vs sequelize. Each provide their own way to specify filters and pagination but both are in proper graphql still. I envision dialect plugins for most common DBs. This would also allow to expose DB specific functionality as needed both on server and on client-side.

Offline is really difficult... you’ll probably end up just hacking something together. You may think you got 100% of the way there, but you really only got 90%

100% correct. My goal wouldn't be to ever target 100% use case coverage, maybe not even 90% but instead the most common use cases to finally enable existing app stacks to allow offline capabilities even if its no where near 100%. The alternative right now is 0% offline capable for most of these projects. If you really need 100% you will have to choose a DB that has replication inherently built in, there is NO way around that. Good news is that there are great projects like CouchDB out there that do exactly that.

wtrocki commented 5 years ago

Really nice ideas in the thread so I want to summarize everything and create actionable items from this super thread.

Over the next day going to create the following issues:

Build network state aware data replication using existing Apollo API.
Build dynamic local query resolvers on top of the existing data in the cache
Investigate ways to deliver row level delta based solution on the server.

wtrocki commented 4 years ago

Small update for this. Our team took this requirement as key competency that we need to enable by the end of the 2019. We knew that we not going to be able to do it without solid base (same way that AWS app sync have done it)

Currently, we have:

Build Offix Library as a generic offline focused client that also supports generic conflict protocols. Offix has scheduler approach now
We build a code generator that helps us to generate a tedious boilerplate code called graphback. See graphback.dev
We have build libraries for supporting reliable subscriptions ( @aerogear/graphql-mqtt-subscriptions)

We have integrated libraries into popular community packages and this enabled us to really tackle this issue. For the moment the only challenge is to pick the right options for the backend.

Options

The challenge we have now is to see if we should stick with the single open source project that will enable offline diff capabilities. We cannot rely 100% of subscriptions as users should be able to get the changes even when they weren't subscribed at the time.

This is where streaming platforms like kafka come as much better alternative to AMQ

EvenSourcing/Event Log using Kafka

Kafka is designed to handle changelogs in very efficient way. Using some external libraries like Debezium can help to get a stream of the changes from the popular databases. Additionally, other backends can connect with Kafka and produce events that are happening. For some simple use cases, Kafka will be overkill.

Building a generic solution for event streaming with filter support

Generally, we could check if we can build some pluggable library that will be able to work with any general-purpose pub/sub mechanism and storage that will store data that is partitioned for well-known filter categories. Limitation for this is that filters will need to be fully available and introducing new filters will bring a lot of additional processing that will be needed to aggregate data again.

Apply event log on actual data set

Adding lastModified or anything on the actual table can give developers the ability to get diff for the changes

wtrocki commented 4 years ago

@ssd71 This is the top level issue that we have moved from offix that covers the work you are doing. As you see this is very old requirement coming from community and it is really exciting to see us finally moving that forward (on different repo). I have added it to post top level progress on the work we have done

wtrocki commented 4 years ago

Graphback.dev now supports datasynchronization in beta phase. Please check our documentation

aerogear / graphback

Delta sync #1382

Proposal for client

Related work

Options

EvenSourcing/Event Log using Kafka

Building a generic solution for event streaming with filter support

Apply event log on actual data set