earthstar-project / earthstar

Storage for private, distributed, offline-first applications.
https://earthstar-project.org
GNU Lesser General Public License v3.0
633 stars 20 forks source link

Sync queries (a.k.a. selective sync, replication queries) #6

Closed cinnamon-bun closed 2 years ago

cinnamon-bun commented 4 years ago

During a sync, apps should be able to specify filters for incoming and outgoing data. This could use the same QueryOpts type we already have.

Pseudocode:

syncer.setIncomingSyncFilters([
    { pathPrefix: 'wiki/' },
    { pathPrefix: 'about/' },
]);
syncer.setOutgoingSyncFilters([
    { author: '@aaa' },  // only upload our own data
]);
syncr.sync(url);

When you supply multiple filters they get OR'd together. In other words, we send things that match ANY filter.

When starting a sync, we'll send the incoming filters to the other peer so they can avoid sending us things we don't want. We'll also apply the incoming filters on our end in case the other peer didn't pay attention.

And data we send will be filtered by the outgoing filters. This lets you only upload data of people you trust (yourself, or people you follow, or not blocked people).

Questions to resolve

Related conversations

sgwilym commented 4 years ago

There is a rough version of this in earthstar-graphql, where both sides of a sync are informed of what the other side wants, and sends only a subset of documents built from that information. Each side trusts that the other side is actually only giving it what it asks for. There is no outgoing filter as of yet.

  1. Peer A sends a GraphQL query to Peer B
    • Peer A queries for documents Peer B holds that conform to its sync filters
    • Peer A also queries Peer B’s sync filters
  2. Peer A compiles a list of documents conforming to Peer B’s sync filters
  3. Peer A sends these documents to Peer B using a ingestDocuments mutation
  4. Peer B ingests the documents from Peer A
    • Peer B currently trusts Peer A to have filtered the documents correctly.
  5. Peer A ingests the documents received from step 1
    • Peer A currently trusts Peer B to have filtered the documents correctly.

Some thoughts on the questions in the first post:

Would these filters need to be different for different peers / pubs you're syncing with, instead of universal?

The earthstar-graphql implementation applies globally to all workspaces stored on a pub, but rather than having settings per peer/pub, I think having separate sync filters per workspace is the one to aim for.

Would we specify a sort order (like newest first) in the query?

If all the documents are going to be ingested anyway, what could the sort order be used for?

Would we want to sync in stages of priority, like "/about/" first, then "/wiki/", ?

Is the idea that smaller requests would be more resilient to adverse network conditions?

cinnamon-bun commented 4 years ago

separate sync filters per workspace is the one to aim for.

I agree, yeah.

sort order & stages of priority

The idea here was to improve the initial sync experience for new users and people on slow connections. The first time they sync, there might be a lot of data to fetch -- we want to fetch the most important data first.

The best user experience would hypothetically be something like

first {pathPrefix: '/about/', sort: 'newestFirst'}
then {pathPrefix: '/wiki/', sort: 'newestFirst'}
then {pathPrefix: '/largeImageData/', sort: 'newestFirst'}

...but it would depend on the application.

The efficient sync algorithm will depend on both sides agreeing on a sort order... maybe we can let the pulling side win (e.g. the incoming filter), since that filter will be known by both sides.

When supplying multiple queries like this, I'm not sure if it will happen as multiple iterations of the sync algorithm, or one giant iteration. Probably multiple.

Is the idea that smaller requests would be more resilient to adverse network conditions?

Not really, but it's easier to write code to get batches of data instead of trying to stream it. I think the sync algorithm will fetch a batch of documents at a time, iteratively, using the {limit: 1000} query setting.

This goes with my general principle of "avoid streams". I know some people swear by them but they're hard to figure out, especially in some languages.

cinnamon-bun commented 4 years ago

How to query for nothing?

In https://github.com/earthstar-project/earthstar-graphql/releases/tag/v4.0.1, @sgwilym asks:

I think these two should be semantically different: an undefined sync filter means the pub has no preference on documents, whereas an empty one would mean the pub is accepting nothing. (???)

Currently queries work like this:

{
    // An empty query object returns all documents.

    // Each of the following adds an additional filter,
    // narrowing down the results further.

    pathPrefix?: string,  // Paths starting with prefix.
    // etc
}

So an incoming sync filter of {} means "I want all documents"; an outgoing filter of {} means "I will give all documents I have".

If we don't want to give / receive ANY documents in a sync, here are 4 ways to do that:

My feelings are:

This is made more complicated because pubs are supposed to have an array of incoming queries, and an array of outgoing queries. Documents that match ANY query in the array will be sent.

So...

Queries in other places in Earthstar

It's tempting to generalize B, to accept null queries anywhere that queries are used in Earthstar. But this would mean changing the Storage query functions, since I don't like mixing undefined and null (seems like a recipe for mistakes):

// currently: omitting the query means the same as setting it to {}: get all documents
documents(query?: QueryOpts)

// the new way?  an argument is required.
// null means "no documents", {} means all documents
documents(query: QueryOpts | null)

In conclusion: 🤷 ?

sgwilym commented 4 years ago

Using null to signify something so meaningful seems laden with danger to me 😬

On reflection, an empty array meaning something is similarly ambiguous to me. I wonder if a more explicit typing would work better?

type SyncFilters = {
    pathPrefixes: string[],
    authorsByVersions: string[],
} | "FILTER_EVERYTHING"

type PubConfig = {
    otherStuff?: Whatever,
    incomingFilters?: SyncFilters,
    outgoingFilters: SyncFilters,
}

// All documents are accepted and sent with these configs:

{
    incomingFilters: {},
}

{
    incomingFilters: { 
        pathPrefixes: []
    },
    outgoingFilters: null
}

// Documents are not accepted

{
    incomingFilters: 'FILTER_EVERYTHING',
}

// Documents are not sent

{
    outgoingFilters: 'FILTER_EVERYTHING'
    incomingFilters: {
        pathPrefixes: ["/gossip"]
    }
}
cinnamon-bun commented 4 years ago

How about

{ limit: 0 }

...as a query that matches nothing?

sgwilym commented 4 years ago

How would that be applied? Like this?

{
  incoming: { limit: 0 },
  outgoing: { pathPrefixes: ["/wiki"]}  
}

It's simple. But is there any meaning/use to setting an incoming filter of { limit: 10 }?

cinnamon-bun commented 4 years ago

But is there any meaning/use to setting an incoming filter of { limit: 10 }?

Maybe if you wanted to have "just the 10 most recently edited docs"?

{
    limit: 10,
    sort: "recent",  // (sort order hasn't been defined yet, but is probably coming soon)
}