Open jkomoros opened 1 year ago
Just thinking out loud about how this might work in a way that is simple and easy to get started... but might smoothly morph into something more complex in the future if people need it.
Each chunk in a library can have an access_tag
. A missing or empty access_tag means it's public. Typically people have one private access tag, called private
, but theoretically they could have multiple. Everything around access_tags defaults to using the tag private
unless another one is provided.
When a library is opened, it can have an access_tag
provided in the constructor, and that automatically adds that access_tag to all chunks inside. Those access tags will flow with that chunk if they are merged into a larger library. As a convenience if the library filename a library is opened with includes access/FOO
in the filename path then it will add access_tag=foo unless access_tag is explicitly passed to the constructor as ''.
There is a access.SECRET.json
in the root directory, with a .gitignore
that ignores any files matching *SECRET*
. It is structured like:
{
//This defaults to "private" if not explicitly set.
"default_private_access_tag": "private"
"tokens": {
//user_vanity_id can be any user-understable name, like "Alex" or "alex@komoroske.com"
<user_vanity_id>: {
"token": <access_token>
//The access_tags this token is allowed to access in this library. If it is omitted it defaults to `["private"]`
"access_tags": ["private"]
}
}
}
There is a convenience script to add items to this file. A command of python3 -m access.main add <user_vanity_id> <access_tag_to_add_1> <access_tag_to_add_2> ...
. You can omit the access.SECRET.json
into memory to figure out which access_tokens allow access to libraries marked up with which access_tags. In the future maybe that access list is fetched from GFS or something so it doesn't have to be redeployed?
The library owner they provide that token to the user some way. The user adds it in their client_config.SECRET.json
affiliated with their server endpoint. Now their client will be automatically send it to that server endpoint as the access_token
parameter.
The library.query() includes an access_token parameter. It's looked up in the access information and see which access_token it grants access to. It filters out any ones that it doesn't have access to.
There should also be a way for people querying hte library endpoint to explictly filter down to only content that includes a given access_tag, or that excludes all but a given access_tag.
Just thinking out loud, this is probably overly complicated (we could probably get away with private just being a bool if we wanted for now.
The purpose of the library hosts is to offer up chunks to be remixed by the completion AI, but not to be directly scraped and shown to users. We don't want them to be used as a convenient way to slurp up people's content for some other purpose, especially for content that is access restricted.
Ultimately there's no good way to fully do that (that I've come up with after thinking about it for a little bit) without having the remixing client be hosted somewhere trusted--but even then a query prompt injection could extract context.
So we can't make it impossible, but we can make it so that anyone who tries to just scrape private chunks would have to obviously know they were doing something unsupported by fighting the system, so no one can credibly claim "I had no idea I wasn't allowed."
I don't know how you'd do that either. Maybe something like having the code that accepts a library from a server and then does the remixing hashes its own python file it booted from and checks that it's a known hash before proceeding? Ultimately we're just trying to make it sufficiently hard to defeat that you have to obviously know you're doing something against the wishes of the content authors, but fundamentally it's not possible to guarantee anything in a federated system that has to see the cleartext.
This idea I'm really just exploring in a somewhat delirious state, this might be an actively terrible idea or unworkable or any of the above. :-D
Please make it go and then we'll mess around it and see how this particular bicycle rides!
access/FOO/
access.SECRET.json
unpublished
access tag.access.host add
scriptaccess.host revoke
script--access-tag
to granthost.SECRET.json
config.host access.grant
config.host access grant
config_file
through every library constructor--mode {dev,prod*}
which sets the default filename to host.dev.SECRET.json
(prod omits any mode tag in filename)host.SECRET.json
be passed in via a Config object that has getters, etcaccess.host
should be able to pass a different config fileAuthentication: Bearer {token}
(update the debug query form in the GET)details.counts.private_chunks
and a access_message
that is rendered in the response and in the GET endpointhost.SECRET.json
provides access to all tags. But make sure it has to be explcitly set, so things fail closed.endpoint
in host.SECRET.json` that configures where the production endpoint is.set endpoint
command that sets the endpoint parameterset restricted.message
commandset restricted.count
commandset endpoint
doesn't include a trailing \
access grant
should also print to output f"{endpoint}/?SECRET=sk-key-123" if endpoint is set (and Call set endpoint ...
to set the endpoint to print out an easy access keyf95cc4896b276a4219c36d0a4639643397ee4e0a, 68e13143fe468cb1c7666e80f0cc4b2e420c9e1a, d77a4ccb6d00fcea3bdc4245fa05fc9fae0c77ad, 98108cc8d16b8e3627ba23b31f7ab04424deb865 were erroneously marked as being part of #32 but are actually part of #26.
Much of the content people will host is their public content, just in a more convenient format.
But some of the content will be private (e.g. unpublished notes or drafts). People will want to have that available, but only to a small list of people.
Figure out some way to protect that content.