Allow access_tag on content to be configured to return everything but text

jkomoros commented 1 year ago

This allows sensing the content via embeddings similar content to what you're looking for without seeing the content itself.

But someone with the access_token should get the full thing.

Another way of looking at this is some content has an access_tag that is configured to strip text unless the access_token is provided.

jkomoros commented 1 year ago

It would be weird if libraries had a mix of bits returned that had embeddings and/or text and some not, so it's probably better for either all text to be stripped or none.

One way to do this is to have restricted.embedding setting in host.SECRET.json. If set to true, then if omit includes text then even restricted items will return bits, just with the text elided. The fact this is possible would have to be advertised somehow, so that a client that hits the endpoint could realize "If I sent omit='text' I'd be able to remote sense even restricted content"

Another thing I'm realizing while thinking through this is that today I'm not sure that we validate that libraries that are served up by hosts have bits that all have text, embedding and token_count. We should drop bits that don't have that and not return them.

Another thing I'm realizing: in the future a host might want to have different settings for different access_tags. For example, maybe you want to allow people with no access tag to not be able to sense embeddings, those with a basic access_tag to be able to sense embeddings, and all_access to be able to get all content. Or perhaps you want to set it so people without an access tag have a max_count of 10, but people with a all_access have a max_count of -1 (infinite).

jkomoros commented 1 year ago

One other thing to consider: embeddings are not one-way hashes, and we should make sure we're not applying that metaphor and having it lead us to do something unsafe. one-way hashes reveal nothing about the content itself, but embeddings are rich with semantics--albeit illegible and arcane semantics. But there's absolutely human-legible semantics to squeeze out of an embedding if you know how to squeeze it right. So revealing an embedding but not text might still reveal quite a bit of semantics.

dglazkov / polymath

Allow access_tag on content to be configured to return everything but text #107