Merkel-CRDTs for data consistency

dao-xyz / peerbit

P2P database framework with encryption, sharding and search

https://peerbit.org

Apache License 2.0

167 stars 11 forks source link

Merkel-CRDTs for data consistency #43

Open TheRook opened 1 year ago

TheRook commented 1 year ago

How does peerbit make sure that a user can add a data record to the database, and make sure no one else removes it or modifies it? If i am not mistaken peerbit and orbitdb expects users to be trusted when modifying a shared collection.

A Merkel-CRDT can be used to make it hard to deny that data was ever seen and provides some source of truth without a full blockchain: https://research.protocol.ai/publications/merkle-crdts-merkle-dags-meet-crdts/psaras2020.pdf https://github.com/ipfs/go-ds-crdt

marcus-pousette commented 1 year ago

The main idea is that every peer need to run their own checks for every commit to verify whether a party can modify a state or not. This means in practice if every party is running the same ACL logic the state would end up to be the same for everyone. However if you don't assume that we can trust parties then there are multiple answers to this question depending on what kind of reliability you want. One way is to define an ACL database with Peerbit itself, then you bind this database with the canAppend and canRead hooks that you can access on setup with your database in mind, like this:

@variant("document")
class Document {
    @field({ type: "string" })
    id: string;

    constructor(props: { id: string }) {
            this.id = props.id;
    }
}

@variant("test_store")
class TestStore extends Program {
    @field({ type: Documents })
    store: Documents<Document>;

    @field({ type: IdentityAccessController })
    accessController: IdentityAccessController; // IdentityAccessController is part of Peerbit standard lib

    constructor(properties: {
        id?: string;
        identity: Identity;
        accessControllerName?: string;
    }) {
        super(properties);
        this.store = new Documents({
            index: new DocumentIndex({
                indexBy: "id",
                query: new RPC(),
            }),
        });
        this.accessController = new IdentityAccessController({
            id: properties.accessControllerName || "test-acl",
            rootTrust: properties.identity.publicKey,
        });
    }

    async setup() {
        await this.accessController.setup();
        await this.store.setup({
            type: Document,
            canRead: this.accessController.canRead.bind(this.accessController),
            canAppend: this.accessController.canRead.bind(this.accessController),
        });
    }
}

Now you might think: If the ACL of one database is defined by another database, what defines the ACL for that database?

The IdentityAccessController database/program is uniquely defined by its rootTrust publicKey, which is basically the creator of the access controller. When someone wants to add a permission to IdentityAccessController it either has to be signed by the root trust, or it has to be signed by an identity that has been trusted by the root trust (trust chain).

The hash of the program TestStore will uniquely be defined by the rootTrust of the accessController if you keep other properties constant, this means, anyone who opens TestStore by its CID will be sure that everyone else that also have it open, will observe the same root trust for the access controller. If someone tries to change the rootTrust for the ACL then all programs that rely on it will change because the hash will change, which means that it is impossible to do that without changing the address for the whole program itself.

With this way of dealing with trust, you can ~ be sure ~, or at least assume that everyone who is writing to TestStore that are part of the ACL canAppend also respects the access controller. In extension, this means that if you search for data that is defined in TestStore and make sure that you verify that the responses are signed by members of the ACL, then you can be "somewhat sure" that query responses are accurate. If you want to make sure that a party is not behaving maliciously just for your particular query, you would aggregate responses from many peers that are part of the ACL and make sure that there is a consensus about what the response should be for a particular query.

A simple example is a database that is a counter. If you query the database to check what the counter is, you could wait for at least 70% of the network to send you a response, then you could take the median value as the true value for your particular query/observation.

If you want to write an application that does not assume any kind of trust, you need to replicate all content by yourself, on your devices. If you are writing an application where resilience is not that important you can just make sure that you approve and reject commits as they come with your local ACL logic.

marcus-pousette commented 1 year ago

Other than that Peerbit uses Merkel CRDTs for consistency (Operation-based CRDTs also called commutative replicated data types, or CmRDTs )

TheRook commented 1 year ago

What interests me most is implementing a real world feature that is hard to scale on a traditional database, but doing so in a distributed database. This is a good thought experiment to find what might be missing.

In this case I am imagining a social network not unlike reddit or twitter implemented in peerbit. So you could have an append-only list that has hashtags and subscribers. But the problem comes to number of upvotes and promoting popular content.

In this case users are incentivised to lie in order to boost their own post and to obscure or deboost the competition. There are already enough fake posts on social media, we don't also need to allow these users to affect other content.

In a purely trust-less system this is hard to solve, and one might argue impossible given that fake account creation is essentially unlimited here so you can always inflate the number with fake accounts.

Let's add some real world constraints. let's say accounts are difficult to make because they have been verified in some way, and let's say that we are in a partially trusted system.

If it was trust-less, except for trusting the root. Then the root could define super peers that are trusted to provide an accurate count, or popularity listing of or more genetically a specific order of records. Using an append-only collection with signed additions or subtractions (upvotes or downvotes) would work, but it would have to be sorted locally and if you have millions of tweets going out that doesn't seem feasible.

In terms of the database, you could define a collection with row-level security which is a drop in REST replacement for some application features. So long as you don't have a malicious peer that can obscure or alter records. For sort order and a trusted counter, I can't think of any other way than a trusted super peer.

On Tue, Dec 13, 2022, 12:57 PM Marcus Pousette @.***> wrote:

Other than that Peerbit uses Merkel CRDTs for consistency (Operation-based CRDTs also called commutative replicated data types, or CmRDTs )

— Reply to this email directly, view it on GitHub https://github.com/dao-xyz/peerbit/issues/43#issuecomment-1349678981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD7MN43AN74E3YM5PP4T4DWNDPLDANCNFSM6AAAAAAS4QFQ54 . You are receiving this because you authored the thread.Message ID: @.***>

marcus-pousette commented 1 year ago

I am interested in the same thing.. The idea of building private and social media application is one of the reason why this project started in the first place.

In this case users are incentivised to lie in order to boost their own post and to obscure or deboost the competition. There are already enough fake posts on social media, we don't also need to allow these users to affect other content.

Sybil resistance is a hard problem, but there are solution you can apply in the distributed settings. You can have psybil resistance as a service provided by different networks (that have different root trusts) and the user could choose to rely on one of them ( for example Worldcoin or some telephone number verification service). Then when doing an upvote, you would sign the message with multiple signatures: 1. Your signature, 2. The service that verifiies that you are a human (and perhaps have not made a vote prior). 3. A service that verifies that the timestamp is accurate. (Peerbit as of now supports multi-signatures for this very reason).

If it was trust-less, except for trusting the root. Then the root could define super peers that are trusted to provide an accurate count, or popularity listing of or more genetically a specific order of records. Using an append-only collection with signed additions or subtractions (upvotes or downvotes) would work, but it would have to be sorted locally and if you have millions of tweets going out that doesn't seem feasible.

In terms of the database, you could define a collection with row-level security which is a drop in REST replacement for some application features. So long as you don't have a malicious peer that can obscure or alter records. For sort order and a trusted counter, I can't think of any other way than a trusted super peer.

Yes, this is true and make sense. You can not circumvent scaling problems for a trusted counter. A trusted super peer that can rely trust onto other peers to distribute work is a solution that you can easily understand.

The main bottle neck I would say, is that if you shard the post database into thousands of shards (to be able to handle millions of posts per second), you are going to pay for the overhead of coordinating search on all the shards and replication reorganization when peers join and leave. If you don't do sharding, then you are going to be bounded by the capabilities of a single machine and network. I do not believe in the federation approach to solve this problem. Instead, I believe this should be a smooth function based on what you are posting. If you just want to post to your 10 subscribers, you should not pay for the computational cost of making the post available for millions of people. However if, for some reason, you would to get greater reach, you should have the opportunity to do so (and only pay the price for that by then).

.. An additional note, there are ways to do voting with ZK proofs to make sure a key can only vote once, however, this does not solve the problem that a user could have multiple keys.

TheRook commented 1 year ago

Apache has two shard-management projects. One of the OtterTune is a top rated project by the Apache foundation, and monitors the health of a sharded database cluster and orchestrates its configuration in realtime. Carnegie Mellon's Database lab first published research and the code that created https://ottertune.com/ .

There is also https://github.com/apache/shardingsphere which I think is more primitive.

So if each database supernode reported back health information on a logging channel, then a trusted node run by the admin could host OtterTune to make sure that the system is efficient. Or even the databases themselves could make informed decisions on what shard key to replicate.

Filecoin runs into the oracle problem of relying on an agent to give out reliable information. One of the benefits we have here over filecoin is the use of cryptography to enforce some level of consistency. Filecoin's methods are open source, and there really isn't any magic in this oracle, there is an agent that randomly checks to make sure the data is there and valid. Another option is quorum computing to detect sybil attacks, I think this topic needs more research.

The homomorphic world has some interesting research around privacy preserving targeted advertising that is applicable to social networks. In that - lets say a collection of social media posts are shared by the #topic name, and the client can choose what collections to subscribe to, which can be used by the client to build their own feed. The smallest amount of information that holds the most amount of trust is the reactions, views and other ranking metrics - which could be hosted by a dedicated K/V instance - which could be sharded and run on cheap hardware, still being compatible with the "totally free" hacker-ethos.

https://crypto.stanford.edu/adnostic/adnostic.pdf https://eprint.iacr.org/2021/1032.pdf

On Sat, Dec 17, 2022 at 11:56 AM Marcus Pousette @.***> wrote:

I am interested in the same thing.. The idea of building private and social media application is one of the reason why this project started in the first place.

In this case users are incentivised to lie in order to boost their own post and to obscure or deboost the competition. There are already enough fake posts on social media, we don't also need to allow these users to affect other content.

Sybil resistance is a hard problem, but there are solution you can apply in the distributed settings. You can have psybil resistance as a service provided by different networks (that have different root trusts) and the user could choose to rely on one of them ( for example Worldcoin or some telephone number verification service). Then when doing an upvote, you would sign the message with multiple signatures: 1. Your signature, 2. The service that verifiies that you are a human (and perhaps have not made a vote prior). 3. A service that verifies that the timestamp is accurate. (Peerbit as of now supports multi-signatures for this very reason).

If it was trust-less, except for trusting the root. Then the root could define super peers that are trusted to provide an accurate count, or popularity listing of or more genetically a specific order of records. Using an append-only collection with signed additions or subtractions (upvotes or downvotes) would work, but it would have to be sorted locally and if you have millions of tweets going out that doesn't seem feasible.

In terms of the database, you could define a collection with row-level security which is a drop in REST replacement for some application features. So long as you don't have a malicious peer that can obscure or alter records. For sort order and a trusted counter, I can't think of any other way than a trusted super peer.

Yes, this is true and make sense. You can not circumvent scaling problems for a trusted counter. A trusted super peer that can rely trust onto other peers to distribute work is a solution that you can easily understand.

The main bottle neck I would say, is that if you shard the post database into thousands of shards (to be able to handle millions of posts per second), you are going to pay for the overhead of coordinating search on all the shards and replication reorganization when peers join and leave. If you don't do sharding, then you are going to be bounded by the capabilities of a single machine and network. I do not believe in the federation approach to solve this problem. Instead, I believe this should be a smooth function based on what you are posting. If you just want to post to your 10 subscribers, you should not pay for the computational cost of making the post available for millions of people. However if, for some reason, you would to get greater reach, you should have the opportunity to do so (and only pay the price for that by then).

.. An additional note, there are ways to do voting with ZK proofs to make sure a key can only vote once, however, this does not solve the problem that a user could have multiple keys.

— Reply to this email directly, view it on GitHub https://github.com/dao-xyz/peerbit/issues/43#issuecomment-1356445348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD7MN256NLCW2D2QFRIVJDWNYLIFANCNFSM6AAAAAAS4QFQ54 . You are receiving this because you authored the thread.Message ID: @.***>

TheRook commented 1 year ago

An additional note, there are ways to do voting with ZK proofs to make sure a key can only vote once, however, this does not solve the problem that a user could have multiple keys.

One solution necessitates the other. If someone was upmost concerned it could be done on chain and use approved wallets who have undergone KYC/AML. Assume that a social media app still has the same security requirements for spam & abuse and fake account detection. We should assume that the app forces validation with a phone number or something more difficult to forge - the unsolved problem is making it non-trivial to boost content and repercussions for cheating. A ZK one-time update I think is needed for shared counters - prevent as much bad behavior as possible.

marcus-pousette commented 1 year ago

Apache has two shard-management projects. One of the OtterTune is a top rated project by the Apache foundation, and monitors the health of a sharded database cluster and orchestrates its configuration in realtime. Carnegie Mellon's Database lab first published research and the code that created https://ottertune.com/ . There is also https://github.com/apache/shardingsphere which I think is more primitive. So if each database supernode reported back health information on a logging channel, then a trusted node run by the admin could host OtterTune to make sure that the system is efficient. Or even the databases themselves could make informed decisions on what shard key to replicate. Filecoin runs into the oracle problem of relying on an agent to give out reliable information. One of the benefits we have here over filecoin is the use of cryptography to enforce some level of consistency. Filecoin's methods are open source, and there really isn't any magic in this oracle, there is an agent that randomly checks to make sure the data is there and valid. Another option is quorum computing to detect sybil attacks, I think this topic needs more research. The homomorphic world has some interesting research around privacy preserving targeted advertising that is applicable to social networks. In that - lets say a collection of social media posts are shared by the #topic name, and the client can choose what collections to subscribe to, which can be used by the client to build their own feed. The smallest amount of information that holds the most amount of trust is the reactions, views and other ranking metrics - which could be hosted by a dedicated K/V instance - which could be sharded and run on cheap hardware, still being compatible with the "totally free" hacker-ethos. https://crypto.stanford.edu/adnostic/adnostic.pdf https://eprint.iacr.org/2021/1032.pdf … On Sat, Dec 17, 2022 at 11:56 AM Marcus Pousette @.> wrote: I am interested in the same thing.. The idea of building private and social media application is one of the reason why this project started in the first place. In this case users are incentivised to lie in order to boost their own post and to obscure or deboost the competition. There are already enough fake posts on social media, we don't also need to allow these users to affect other content. Sybil resistance is a hard problem, but there are solution you can apply in the distributed settings. You can have psybil resistance as a service provided by different networks (that have different root trusts) and the user could choose to rely on one of them ( for example Worldcoin or some telephone number verification service). Then when doing an upvote, you would sign the message with multiple signatures: 1. Your signature, 2. The service that verifiies that you are a human (and perhaps have not made a vote prior). 3. A service that verifies that the timestamp is accurate. (Peerbit as of now supports multi-signatures for this very reason). If it was trust-less, except for trusting the root. Then the root could define super peers that are trusted to provide an accurate count, or popularity listing of or more genetically a specific order of records. Using an append-only collection with signed additions or subtractions (upvotes or downvotes) would work, but it would have to be sorted locally and if you have millions of tweets going out that doesn't seem feasible. In terms of the database, you could define a collection with row-level security which is a drop in REST replacement for some application features. So long as you don't have a malicious peer that can obscure or alter records. For sort order and a trusted counter, I can't think of any other way than a trusted super peer. Yes, this is true and make sense. You can not circumvent scaling problems for a trusted counter. A trusted super peer that can rely trust onto other peers to distribute work is a solution that you can easily understand. The main bottle neck I would say, is that if you shard the post database into thousands of shards (to be able to handle millions of posts per second), you are going to pay for the overhead of coordinating search on all the shards and replication reorganization when peers join and leave. If you don't do sharding, then you are going to be bounded by the capabilities of a single machine and network. I do not believe in the federation approach to solve this problem. Instead, I believe this should be a smooth function based on what you are posting. If you just want to post to your 10 subscribers, you should not pay for the computational cost of making the post available for millions of people. However if, for some reason, you would to get greater reach, you should have the opportunity to do so (and only pay the price for that by then). .. An additional note, there are ways to do voting with ZK proofs to make sure a key can only vote once, however, this does not solve the problem that a user could have multiple keys. — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD7MN256NLCW2D2QFRIVJDWNYLIFANCNFSM6AAAAAAS4QFQ54 . You are receiving this because you authored the thread.Message ID: @.>

Thanks for the references. I am going to check it out.

Yes filecoin among other storage providers is going with the trust-less approach and it has its benefits and drawbacks. I like to imagine the permissioned approach can give cost benifits in terms of storage and speed. Though I would still like to in the end see some tokenomics systems that would allow people will low electricity costs that maybe have a mobile phone laying around to use it participate and get some revenue overtime.

The smallest amount of information that holds the most amount of trust is the reactions, views and other ranking metrics - which could be hosted by a dedicated K/V instance - which could be sharded and run on cheap hardware, still being compatible with the "totally free" hacker-ethos.

Yes. If the data amount is small enough and there are at least 5 persons online at all times, you don't even need a dedicated kv instance since data will always be available from some peer to share.

One solution necessitates the other. If someone was upmost concerned it could be done on chain and use approved wallets who have undergone KYC/AML. Assume that a social media app still has the same security requirements for spam & abuse and fake account detection. We should assume that the app forces validation with a phone number or something more difficult to forge - the unsolved problem is making it non-trivial to boost content and repercussions for cheating. A ZK one-time update I think is needed for shared counters - prevent as much bad behavior as possible.

yes

TheRook commented 1 year ago

We tend to agree a lot, so you'll likely agree with this. Distributed systems are great and all - but did you know all of Blizzard N. America ran on a single PC? It was just an IRC server for signaliging + chat, and the games themselves were hosted by the peers. This was in the late 90s and early 2000s, and now having a gigabit line at home isn't that unusual. Sure token incentives are working for Filecoin, and that is a viable method

but unpaid Supernodes worked just fine for Skype and KaZaa which scaled to a node count that was larger than most countries.

Having a smaller set of semi-trusted nodes means there are fewer nodes to verify. If every response from the database is reputable - then at some predefined rate you could sample the same query against a collection of nodes - and if the responses are signed and there is a method of reputation gathering then clients can report sybil attacks.

But now we run into the problem where an attacker may want to report trusted nodes to suppress the honest competition so that they can cook the books. Let's say in this threat model we are thinking of an adversary who wants their post to be boosted in a social network. Weeding out trusted nodes by claiming that they are lying is one approach, and you would either need a central authority or a way to hold a fair election.

Trust carries its own value and in this system it incurs a performance penalty or currency cost. Expecting an app owner to run a server agent on their laptop or an old phone plugged into a wall outlet in the corner is a low bar and provides a central authority. This agent could be used to run something like ShardingSphere and a reputation+diagnostic system for supernodes. This can give the app owner a dashboard to view and manage the health of the cluster. An admin could manually promote other peers to admin or supernode roles or add paid resources to the cluster.

On Tue, Dec 20, 2022 at 9:23 AM Marcus Pousette @.***> wrote:

Apache has two shard-management projects. One of the OtterTune is a top rated project by the Apache foundation, and monitors the health of a sharded database cluster and orchestrates its configuration in realtime. Carnegie Mellon's Database lab first published research and the code that created https://ottertune.com/ . There is also https://github.com/apache/shardingsphere which I think is more primitive. So if each database supernode reported back health information on a logging channel, then a trusted node run by the admin could host OtterTune to make sure that the system is efficient. Or even the databases themselves could make informed decisions on what shard key to replicate. Filecoin runs into the oracle problem of relying on an agent to give out reliable information. One of the benefits we have here over filecoin is the use of cryptography to enforce some level of consistency. Filecoin's methods are open source, and there really isn't any magic in this oracle, there is an agent that randomly checks to make sure the data is there and valid. Another option is quorum computing to detect sybil attacks, I think this topic needs more research. The homomorphic world has some interesting research around privacy preserving targeted advertising that is applicable to social networks. In that - lets say a collection of social media posts are shared by the #topic name, and the client can choose what collections to subscribe to, which can be used by the client to build their own feed. The smallest amount of information that holds the most amount of trust is the reactions, views and other ranking metrics - which could be hosted by a dedicated K/V instance - which could be sharded and run on cheap hardware, still being compatible with the "totally free" hacker-ethos. https://crypto.stanford.edu/adnostic/adnostic.pdf https://eprint.iacr.org/2021/1032.pdf … <#m2088352756453540009> On Sat, Dec 17, 2022 at 11:56 AM Marcus Pousette @.> wrote: I am interested in the same thing.. The idea of building private and social media application is one of the reason why this project started in the first place. In this case users are incentivised to lie in order to boost their own post and to obscure or deboost the competition. There are already enough fake posts on social media, we don't also need to allow these users to affect other content. Sybil resistance is a hard problem, but there are solution you can apply in the distributed settings. You can have psybil resistance as a service provided by different networks (that have different root trusts) and the user could choose to rely on one of them ( for example Worldcoin or some telephone number verification service). Then when doing an upvote, you would sign the message with multiple signatures: 1. Your signature, 2. The service that verifiies that you are a human (and perhaps have not made a vote prior). 3. A service that verifies that the timestamp is accurate. (Peerbit as of now supports multi-signatures for this very reason). If it was trust-less, except for trusting the root. Then the root could define super peers that are trusted to provide an accurate count, or popularity listing of or more genetically a specific order of records. Using an append-only collection with signed additions or subtractions (upvotes or downvotes) would work, but it would have to be sorted locally and if you have millions of tweets going out that doesn't seem feasible. In terms of the database, you could define a collection with row-level security which is a drop in REST replacement for some application features. So long as you don't have a malicious peer that can obscure or alter records. For sort order and a trusted counter, I can't think of any other way than a trusted super peer. Yes, this is true and make sense. You can not circumvent scaling problems for a trusted counter. A trusted super peer that can rely trust onto other peers to distribute work is a solution that you can easily understand. The main bottle neck I would say, is that if you shard the post database into thousands of shards (to be able to handle millions of posts per second), you are going to pay for the overhead of coordinating search on all the shards and replication reorganization when peers join and leave. If you don't do sharding, then you are going to be bounded by the capabilities of a single machine and network. I do not believe in the federation approach to solve this problem. Instead, I believe this should be a smooth function based on what you are posting. If you just want to post to your 10 subscribers, you should not pay for the computational cost of making the post available for millions of people. However if, for some reason, you would to get greater reach, you should have the opportunity to do so (and only pay the price for that by then). .. An additional note, there are ways to do voting with ZK proofs to make sure a key can only vote once, however, this does not solve the problem that a user could have multiple keys. — Reply to this email directly, view it on GitHub <#43 (comment) https://github.com/dao-xyz/peerbit/issues/43#issuecomment-1356445348>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD7MN256NLCW2D2QFRIVJDWNYLIFANCNFSM6AAAAAAS4QFQ54 https://github.com/notifications/unsubscribe-auth/AAD7MN256NLCW2D2QFRIVJDWNYLIFANCNFSM6AAAAAAS4QFQ54 . You are receiving this because you authored the thread.Message ID: @.>

Thanks for the references. I am going to check it out.

Yes filecoin among other storage providers is going with the trust-less approach and it has its benefits and drawbacks. I like to imagine the permissioned approach can give cost benifits in terms of storage and speed. Though I would still like to in the end see some tokenomics systems that in the end that would allow people will low electricity costs that maybe have a mobile phone laying around to use it participate and get some revenue overtime.

The smallest amount of information that holds the most amount of trust is the reactions, views and other ranking metrics - which could be hosted by a dedicated K/V instance - which could be sharded and run on cheap hardware, still being compatible with the "totally free" hacker-ethos.

Yes. If the data amount is small enough and there are at least 5 persons online at all times, you don't even need a dedicated kv instance since data will always be available from some peer to share.

One solution necessitates the other. If someone was upmost concerned it could be done on chain and use approved wallets who have undergone KYC/AML. Assume that a social media app still has the same security requirements for spam & abuse and fake account detection. We should assume that the app forces validation with a phone number or something more difficult to forge

the unsolved problem is making it non-trivial to boost content and repercussions for cheating. A ZK one-time update I think is needed for shared counters - prevent as much bad behavior as possible.

yes

— Reply to this email directly, view it on GitHub https://github.com/dao-xyz/peerbit/issues/43#issuecomment-1359849276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD7MN7YC5QPN2VLJGOMUCLWOHTRJANCNFSM6AAAAAAS4QFQ54 . You are receiving this because you authored the thread.Message ID: @.***>

marcus-pousette commented 1 year ago

That is actually super cool, didn't know that about early Blizzard (and Skype and KaZaa). It is actually really exciting to think about how cost efficient and performant things we can build considering how capable our hand held devices and cheap laptops are and all the improved networking capabilities that exist today.

Having a smaller set of semi-trusted nodes means there are fewer nodes to verify. If every response from the database is reputable - then at some predefined rate you could sample the same query against a collection of nodes - and if the responses are signed and there is a method of reputation gathering then clients can report sybil attacks.

Yes, this was exactly how I am imagine it. But I would maybe even go further a see trust as a continuous thing where we at one end have the "trustless" nodes governed through some PoS and on the other end have devices with unreliable network and storage capacity (mobile phones etc). Nodes that are actively malicious and nodes that performs badly will have kind of the same outcome, they will degrade the network, create data-loss or malformed data and unresponsive services if people rely on them. These might be mitigated with different measures, but no matter what, you need to determine whether you are to trust a device or not, and the parameters that you have to consider is not just whether the device owner is trustworthy, but also if the device itself can be relied upon.

Regarding this, I have started working a little bit on a sharding algorithm that improves itself through mathematical optimization. You would have set of parameters that you map with a function (that you optimize) to a value that would define the probability (or slot hash width) that a peer would be elected as a leader for storing and maintaining a state. I imagine that you could condition this function on various things, like device capacity, battery life, mean network speeds, and how trusted this node is by direct peers etc.

Trust carries its own value and in this system it incurs a performance penalty or currency cost. Expecting an app owner to run a server agent on their laptop or an old phone plugged into a wall outlet in the corner is a low bar and provides a central authority. This agent could be used to run something like ShardingSphere and a reputation+diagnostic system for supernodes. This can give the app owner a dashboard to view and manage the health of the cluster. An admin could manually promote other peers to admin or supernode roles or add paid resources to the cluster.

Yes. And you need this low bar if you want adoption in the first place, because many apps today still need to run with a "server client" mindset with a central authority. What you describe later is basically a way for a developer with "server client" way of working to smoothly migrate to a "peer to peer" way of working

TheRook commented 1 year ago

Very interesting. Having, shared management is a great feature. I'll have to take a peek at your implantation.

There should be no single type of supernode. Some have a lot of bandwidth, or storage or uptime. Pairing a node with good uptime with one of a lot of bandwidth for example can make a better combined supernode. If a collection is heavily using pub/sub for big data then it may need more read replicas for example.

It would be great to set some kind of record, like 'searching the largest distributed dataset' something along those lines. I am working with Decentaralied Science (DeSci) which is apart of LabDAO on a distributed citizen sciences project on sharing biological data between citizen science projects.

String search is useful for bioinformatics, and being able to perform other searches - like regex searches would enable scientists to make use of a vast data set. Time series data and collection of sensor data from IoT devices is another use case.

The French project Constellation uses orbit, but I think peerbit would be better: https://github.com/reseau-constellation.

https://blog.nationalgeographic.org/2016/02/13/how-citizen-science-changed-the-way-fukushima-radiation-is-reported/

On Tue, Dec 20, 2022, 1:01 PM Marcus Pousette @.***> wrote:

That is actually super cool, didn't know that about early Blizzard (and Skype and KaZaa). It is actually really exciting to think about how cost efficient and performant things we can build considering how capable our hand held devices and cheap laptops are and all the improved networking capabilities that exist today.

Having a smaller set of semi-trusted nodes means there are fewer nodes to verify. If every response from the database is reputable - then at some predefined rate you could sample the same query against a collection of nodes - and if the responses are signed and there is a method of reputation gathering then clients can report sybil attacks.

Yes, this was exactly how I am imagine it. But I would maybe even go further a see trust as a continuous thing where we at one end have the "trustless" nodes governed through some PoS and on the other end have devices with unreliable network and storage capacity (mobile phones etc). Nodes that are actively malicious and nodes that performs badly will have kind of the same outcome, they will degrade the network, create data-loss and unresponsive services if people rely on them. These might be mitigated with different measures, but no matter what, you need to determine whether you are to trust a device or not, and the parameters that you have to consider is not just whether the device owner is trustworthy, but also if the device itself can be relied upon.

Regarding this, I have started working a little bit on a sharding algorithm that improves itself through mathematical optimization. You would have set of parameters that you map with a function (that you optimize) to a value that would define the probability that a peer would be elected as a leader for storing and maintaining a state. I imagine that you could condition this function on various things, like device capacity and how trusted this node is by direct peers etc.

Trust carries its own value and in this system it incurs a performance penalty or currency cost. Expecting an app owner to run a server agent on their laptop or an old phone plugged into a wall outlet in the corner is a low bar and provides a central authority. This agent could be used to run something like ShardingSphere and a reputation+diagnostic system for supernodes. This can give the app owner a dashboard to view and manage the health of the cluster. An admin could manually promote other peers to admin or supernode roles or add paid resources to the cluster.

Yes. And you need this low bar if you want adoption in the first place, because many apps today still need to run with a "server client" mindset with a central authority. What you describe later is basically a way for a developer with "server client" way of working to smoothly migrate to a "peer to peer" way of working

— Reply to this email directly, view it on GitHub https://github.com/dao-xyz/peerbit/issues/43#issuecomment-1360237063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD7MNYQGIXFBEY4QBRCMYLWOINCVANCNFSM6AAAAAAS4QFQ54 . You are receiving this because you authored the thread.Message ID: @.***>

marcus-pousette commented 1 year ago

Very interesting. Having, shared management is a great feature. I'll have to take a peek at your implantation.

I have not written the implementation for yet, but will let you know when I have anything meaningful

There should be no single type of supernode. Some have a lot of bandwidth, or storage or uptime. Pairing a node with good uptime with one of a lot of bandwidth for example can make a better combined supernode. If a collection is heavily using pub/sub for big data then it may need more read replicas for example.

Ah, yes, basically you would also then condition the function on the work that is to be performed also. That makes sense!

It would be great to set some kind of record, like 'searching the largest distributed dataset' something along those lines. I am working with Decentaralied Science (DeSci) which is apart of LabDAO on a distributed citizen sciences project on sharing biological data between citizen science projects.

String search is useful for bioinformatics, and being able to perform other searches - like regex searches would enable scientists to make use of a vast data set. Time series data and collection of sensor data from IoT devices is another use case.

Cool that you are involved with that! I am going to read more about it.

Yes, that is actually a real good use case for Peerbit since its is heavily optimized for creating sharded Document store that could distributed content and index it on a large amount of nodes, and then allow everything to be searchable at once. There is some work to improve the indexing parts of it (which we have discussed briefly before). Timeseries are also interesting, and its not hugely different from livestreaming and collecting video (which I am currently optimizing the project to be able to handle in terms of performance). Right now the video demo I am working on is actually storing Webm Matroska elements in independent "documents" in a Document store, just like how you would store text documents, so that you could shard the video segments onto multiple nodes. This could also be applied for arbitrary use-cases where if you consider the time-series events to be independent of each other. What is interesting is that for popular content, like video segments, or documents, music etc that a few people always are actively listen on, you would not need that many super nodes that replicate the content as a backup, if you trust your peers to "seed" then the content will be available somewhere.

TheRook commented 1 year ago

Maybe it would be best to focus on search related features and if you are interested I can have you meet with the LabDAO leadership to talk about distributed infrastructure for big data in the sciences.

We also have grants available for science tooling.

On Wed, Dec 21, 2022, 3:30 PM Marcus Pousette @.***> wrote:

Very interesting. Having, shared management is a great feature. I'll have to take a peek at your implantation.

I have not written the implementation for yet, but will let you know when I have anything meaningful

There should be no single type of supernode. Some have a lot of bandwidth, or storage or uptime. Pairing a node with good uptime with one of a lot of bandwidth for example can make a better combined supernode. If a collection is heavily using pub/sub for big data then it may need more read replicas for example.

Ah, yes, basically you would also then condition the function on the work that is to be performed also. That makes sense!

It would be great to set some kind of record, like 'searching the largest distributed dataset' something along those lines. I am working with Decentaralied Science (DeSci) which is apart of LabDAO on a distributed citizen sciences project on sharing biological data between citizen science projects.

String search is useful for bioinformatics, and being able to perform other searches - like regex searches would enable scientists to make use of a vast data set. Time series data and collection of sensor data from IoT devices is another use case.

Cool that you are involved with that!

Yes, that is actually a real good usecase for Peerbit since its is heavily optimized for creating sharded Document store that could be separated on a large amount of nodes, and allow everything to be searchable at once. There is some work to improve the indexing parts of it (which we have discussed briefly before). Timeseries are also interesting, and its not hugely different from livestreaming and collecting video (which I am currently optimizing the project to be able to handle in terms of performance)

— Reply to this email directly, view it on GitHub https://github.com/dao-xyz/peerbit/issues/43#issuecomment-1362217760, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD7MNZAZYVWUW57HRFWKC3WOOHH5ANCNFSM6AAAAAAS4QFQ54 . You are receiving this because you authored the thread.Message ID: @.***>

marcus-pousette commented 1 year ago

Maybe it would be best to focus on search related features and if you are interested I can have you meet with the LabDAO leadership to talk about distributed infrastructure for big data in the sciences.

We also have grants available for science tooling.

Exploring different kinds of applications early on will help you understand what limitations you are building into the protocol, and will let you mitigate issues that are hard to change later. But I agree with you that it is important to focus on one thing, and do it really good if you want to bring value to the dev community at all.

I am very interested to have a meeting with the LabDAO leadership and discuss and learn about your problem domain more.

Grants are something that would help a lot at this stage!

How should we go on about this meeting? I just joined your Discord (username: marcus-pousette#1106)

TheRook commented 1 year ago

On LabDAO? Go ahead and write an introduction for yourself and post it, I'll send you a friend request and we can maybe set up a meeting sometime next week.

On Thu, Dec 22, 2022 at 3:01 PM Marcus Pousette @.***> wrote:

Maybe it would be best to focus on search related features and if you are interested I can have you meet with the LabDAO leadership to talk about distributed infrastructure for big data in the sciences.

We also have grants available for science tooling.

Exploring different kinds of applications early on will help you understand what limitations you are building into the protocol, and will let you mitigate issues that are hard to change later. But I agree with you that it is important to focus on one thing, and do it really good if you want to bring value to the dev community at all.

I am very interested to have a meeting with the LabDAO leadership and discuss and learn about your problem domain more.

Grants are something that would help a lot at this stage!

How should we go on about this meeting? I just joined your Discord

— Reply to this email directly, view it on GitHub https://github.com/dao-xyz/peerbit/issues/43#issuecomment-1363420672, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD7MN6UPBNN5KDARCZCHHDWOTMUJANCNFSM6AAAAAAS4QFQ54 . You are receiving this because you authored the thread.Message ID: @.***>

marcus-pousette commented 1 year ago

Yes, sound good!