dglazkov / polymath

MIT License
132 stars 9 forks source link

Figure out a way to host private content #26

Open jkomoros opened 1 year ago

jkomoros commented 1 year ago

Much of the content people will host is their public content, just in a more convenient format.

But some of the content will be private (e.g. unpublished notes or drafts). People will want to have that available, but only to a small list of people.

Figure out some way to protect that content.

jkomoros commented 1 year ago

Just thinking out loud about how this might work in a way that is simple and easy to get started... but might smoothly morph into something more complex in the future if people need it.

Each chunk in a library can have an access_tag. A missing or empty access_tag means it's public. Typically people have one private access tag, called private, but theoretically they could have multiple. Everything around access_tags defaults to using the tag private unless another one is provided.

When a library is opened, it can have an access_tag provided in the constructor, and that automatically adds that access_tag to all chunks inside. Those access tags will flow with that chunk if they are merged into a larger library. As a convenience if the library filename a library is opened with includes access/FOO in the filename path then it will add access_tag=foo unless access_tag is explicitly passed to the constructor as ''.

There is a access.SECRET.json in the root directory, with a .gitignore that ignores any files matching *SECRET*. It is structured like:

{
  //This defaults to "private" if not explicitly set.
  "default_private_access_tag": "private"
  "tokens": {
      //user_vanity_id can be any user-understable name, like "Alex" or "alex@komoroske.com"
      <user_vanity_id>: {
        "token": <access_token>
        //The access_tags this token is allowed to access in this library. If it is omitted it defaults to `["private"]`
        "access_tags": ["private"]
    }
  }
}

There is a convenience script to add items to this file. A command of python3 -m access.main add <user_vanity_id> <access_tag_to_add_1> <access_tag_to_add_2> .... You can omit the and it will default to . When you run the command it will add a crypotgraphically secure random string for a token for htat user. It will also print the string to the console for transmitting to the user, and remind the owner of the library to redeploy for the changes to take effect. (There's also a script to revoke tokens for a user) That file is just uploaded as part of the app engine app, and when the host boots it reads access.SECRET.json into memory to figure out which access_tokens allow access to libraries marked up with which access_tags. In the future maybe that access list is fetched from GFS or something so it doesn't have to be redeployed?

The library owner they provide that token to the user some way. The user adds it in their client_config.SECRET.json affiliated with their server endpoint. Now their client will be automatically send it to that server endpoint as the access_token parameter.

The library.query() includes an access_token parameter. It's looked up in the access information and see which access_token it grants access to. It filters out any ones that it doesn't have access to.

There should also be a way for people querying hte library endpoint to explictly filter down to only content that includes a given access_tag, or that excludes all but a given access_tag.

Just thinking out loud, this is probably overly complicated (we could probably get away with private just being a bool if we wanted for now.

jkomoros commented 1 year ago

The purpose of the library hosts is to offer up chunks to be remixed by the completion AI, but not to be directly scraped and shown to users. We don't want them to be used as a convenient way to slurp up people's content for some other purpose, especially for content that is access restricted.

Ultimately there's no good way to fully do that (that I've come up with after thinking about it for a little bit) without having the remixing client be hosted somewhere trusted--but even then a query prompt injection could extract context.

So we can't make it impossible, but we can make it so that anyone who tries to just scrape private chunks would have to obviously know they were doing something unsupported by fighting the system, so no one can credibly claim "I had no idea I wasn't allowed."

I don't know how you'd do that either. Maybe something like having the code that accepts a library from a server and then does the remixing hashes its own python file it booted from and checks that it's a known hash before proceeding? Ultimately we're just trying to make it sufficiently hard to defeat that you have to obviously know you're doing something against the wishes of the content authors, but fundamentally it's not possible to guarantee anything in a federated system that has to see the cleartext.

This idea I'm really just exploring in a somewhat delirious state, this might be an actively terrible idea or unworkable or any of the above. :-D

dglazkov commented 1 year ago

Please make it go and then we'll mess around it and see how this particular bicycle rides!

jkomoros commented 1 year ago
jkomoros commented 1 year ago

f95cc4896b276a4219c36d0a4639643397ee4e0a, 68e13143fe468cb1c7666e80f0cc4b2e420c9e1a, d77a4ccb6d00fcea3bdc4245fa05fc9fae0c77ad, 98108cc8d16b8e3627ba23b31f7ab04424deb865 were erroneously marked as being part of #32 but are actually part of #26.