Figure out a way to host private content

jkomoros commented 1 year ago

Much of the content people will host is their public content, just in a more convenient format.

But some of the content will be private (e.g. unpublished notes or drafts). People will want to have that available, but only to a small list of people.

Figure out some way to protect that content.

jkomoros commented 1 year ago

Just thinking out loud about how this might work in a way that is simple and easy to get started... but might smoothly morph into something more complex in the future if people need it.

Each chunk in a library can have an access_tag. A missing or empty access_tag means it's public. Typically people have one private access tag, called private, but theoretically they could have multiple. Everything around access_tags defaults to using the tag private unless another one is provided.

When a library is opened, it can have an access_tag provided in the constructor, and that automatically adds that access_tag to all chunks inside. Those access tags will flow with that chunk if they are merged into a larger library. As a convenience if the library filename a library is opened with includes access/FOO in the filename path then it will add access_tag=foo unless access_tag is explicitly passed to the constructor as ''.

There is a access.SECRET.json in the root directory, with a .gitignore that ignores any files matching *SECRET*. It is structured like:

{
  //This defaults to "private" if not explicitly set.
  "default_private_access_tag": "private"
  "tokens": {
      //user_vanity_id can be any user-understable name, like "Alex" or "alex@komoroske.com"
      <user_vanity_id>: {
        "token": <access_token>
        //The access_tags this token is allowed to access in this library. If it is omitted it defaults to `["private"]`
        "access_tags": ["private"]
    }
  }
}

There is a convenience script to add items to this file. A command of python3 -m access.main add <user_vanity_id> <access_tag_to_add_1> <access_tag_to_add_2> .... You can omit the and it will default to . When you run the command it will add a crypotgraphically secure random string for a token for htat user. It will also print the string to the console for transmitting to the user, and remind the owner of the library to redeploy for the changes to take effect. (There's also a script to revoke tokens for a user) That file is just uploaded as part of the app engine app, and when the host boots it reads access.SECRET.json into memory to figure out which access_tokens allow access to libraries marked up with which access_tags. In the future maybe that access list is fetched from GFS or something so it doesn't have to be redeployed?

The library owner they provide that token to the user some way. The user adds it in their client_config.SECRET.json affiliated with their server endpoint. Now their client will be automatically send it to that server endpoint as the access_token parameter.

The library.query() includes an access_token parameter. It's looked up in the access information and see which access_token it grants access to. It filters out any ones that it doesn't have access to.

There should also be a way for people querying hte library endpoint to explictly filter down to only content that includes a given access_tag, or that excludes all but a given access_tag.

Just thinking out loud, this is probably overly complicated (we could probably get away with private just being a bool if we wanted for now.

jkomoros commented 1 year ago

The purpose of the library hosts is to offer up chunks to be remixed by the completion AI, but not to be directly scraped and shown to users. We don't want them to be used as a convenient way to slurp up people's content for some other purpose, especially for content that is access restricted.

Ultimately there's no good way to fully do that (that I've come up with after thinking about it for a little bit) without having the remixing client be hosted somewhere trusted--but even then a query prompt injection could extract context.

So we can't make it impossible, but we can make it so that anyone who tries to just scrape private chunks would have to obviously know they were doing something unsupported by fighting the system, so no one can credibly claim "I had no idea I wasn't allowed."

I don't know how you'd do that either. Maybe something like having the code that accepts a library from a server and then does the remixing hashes its own python file it booted from and checks that it's a known hash before proceeding? Ultimately we're just trying to make it sufficiently hard to defeat that you have to obviously know you're doing something against the wishes of the content authors, but fundamentally it's not possible to guarantee anything in a federated system that has to see the cleartext.

This idea I'm really just exploring in a somewhat delirious state, this might be an actively terrible idea or unworkable or any of the above. :-D

dglazkov commented 1 year ago

Please make it go and then we'll mess around it and see how this particular bicycle rides!

jkomoros commented 1 year ago

[x] Library constructor takes access_tag property.
[x] Access_tags are stripped out of the JSON before being Serialized unless include_access_tag is True.
[x] Library() constructor has sugar for a filename that includes access/FOO/
[x] Library handles access_tag=True --> access_tag = DEFAULT_PRIVATE_ACCESS_TAG
[x] Create an access.SECRET.json (gitignore), and document format
[x] Load access.SECRET.json if it exists
[x] Allow specifiying an access file different than access.SECRET.json
[x] Library.query() filters out any items that have an access_tag in them.
[x] Verify that access_token really works and filters out items that don't have access_tag but allows access to items that do
[x] Medium importer can be used to generate a draft output with unpublished access tag.
[x] Library.query() accepts an access_token and keeps any items that have it.
[x] Client should be able to be passed a different config file
[x] Have a client.SECRET.json file that can store access tokens
[x] Client sends the access_token if it exists.
[x] Create a access.host add script
[x] Create an access.host revoke script
[x] Add a way to pass multiple --access-tag to grant
[x] Move the default file to host.SECRET.json
[x] Move the tool to config.host access.grant
[x] Make the tool have sub commands like config.host access grant
[ ] Wire through config_file through every library constructor
[x] Add schemas for library file, client.json, host.json (new issue)
[ ] Add --mode {dev,prod*} which sets the default filename to host.dev.SECRET.json (prod omits any mode tag in filename)
[ ] Consider having the host.SECRET.json be passed in via a Config object that has getters, etc
[ ] access.host should be able to pass a different config file
[ ] Access token should be passed as a Authentication: Bearer {token} (update the debug query form in the GET)
[ ] Add host.SECRET.json:restricted.message to the GET for the endpoint too
[x] the README should document the best practice of private content without all of hte overhead about access tags
[x] There should be some way for a server to communicate that it has private content. E.g. an opt into details.counts.private_chunks and a access_message that is rendered in the response and in the GET endpoint
[x] Truncate items in personal production items
[ ] Allow access tokens to be skipped in development (easiest way is probably turning off Library constructors auto-setting access_tags.
[ ] Allow a way to specify that a given token in host.SECRET.json provides access to all tags. But make sure it has to be explcitly set, so things fail closed.
[ ] Allow a endpoint in host.SECRET.json` that configures where the production endpoint is.
[x] Add a set endpoint command that sets the endpoint parameter
[x] Add a set restricted.message command
[x] Add a set restricted.count command
[ ] Validate that set endpoint doesn't include a trailing \
[ ] Add a command to give an example or documentation for each property
[x] The output of access grant should also print to output f"{endpoint}/?SECRET=sk-key-123" if endpoint is set (and Call set endpoint ... to set the endpoint to print out an easy access key
[ ] (Read through the description above to see what else I'm missing in these TODOs)

jkomoros commented 1 year ago

f95cc4896b276a4219c36d0a4639643397ee4e0a, 68e13143fe468cb1c7666e80f0cc4b2e420c9e1a, d77a4ccb6d00fcea3bdc4245fa05fc9fae0c77ad, 98108cc8d16b8e3627ba23b31f7ab04424deb865 were erroneously marked as being part of #32 but are actually part of #26.

dglazkov / polymath

Figure out a way to host private content #26