Figure out ways to discourage scraping

jkomoros commented 1 year ago

Originally explored a bit in #26 but we might want to do something more.

The point of a Polymath endpoint is not "scrape up all my bits of content" it's "select relevant bits of content for the purpose of doing a polymath query." There's no way to fully prevent the former, but we can make it very, very clear that that is Not What You Should Be Doing, by documenting clearly that that is not what a Polymath host is opting into, using things like robots.txt, and just in general requiring would-be-scrapers to crawl through some broken glass to scrape, so they might question whether maybe they just shouldn't.

There are a number of things we might do, and I might explore them in this issue at some point.

But a few no brainers:

[ ] Document clearly in the README that scraping is not the intention of these endpoints
[ ] Make it so hosts by default cap the amount or results so they can't be used to slurp in the entire library. (Although there is still a use case of 'omit=*' with all content to see how big a library is)
[ ] Add a robots.txt or similar

jkomoros commented 1 year ago

OK, developing a wild, likely over-complex idea that might not be workable. Building on an idea already sketched out a bit in #26.

Conceptually, the architecture of all things that fit into the polymath universe look like:

ui - the interface that is exposed to the end-user for a polymath mixer. Sometimes a CLI, sometimes a simple webapp, sometimes a complex webapp.
mixer (currently called client). Its job is to take a query, compute its embedding, reach out to 1..n endpoints to fetch bits of content, then combine the bits and the query into a completion prompt to OpenAI to get the answer to then pass back to the UI. It needs access to an OPENAI_API_KEY because it needs to compute an embedding live, and also do the completion with the prompt that includes the content bits. Sometimes it is literally implemented in the UI layer, e.g. the simple client-side webapp served up at a polymath endpoint for GETs, and sometimes it is running on a server. Its secret information is its OPENAI_API_KEY.
host The frontend endpoint that serves up bits of content that mixers might select from and mix into an answer. Some of its content might be private or protected via access_tokens, and it might want to discourage wholesale scraping. Its secret information is access information and configuration of the endpoint--which might be stored in host.SECRET.json or Firestore. Each host is connected to 1 to n backing stores. If the host has n backing stores (e.g. a multi-tenant situation) it is responsible for selecting the right backing store to reach out to based on the API request (e.g. switching based on the hostname).
store Where the bits of content are actually stored. It needed to have access to an OPEN_AI_API at content ingestion time but doesn't need to do anything special at serving time. The store might just be a library JSON file in a GAE instance as part of a host, or e.g. Pinecone.

The simplest polymath is a simple CLI that works with a mixer locally that talks to a host that just loads up all of the library files in libraries/, but it can get considerably more complex, where each part is on a different computer.

OK, let's layer on top of this a scheme that makes scraping hard. (Remember: a clever user can always extract bits of content with a cleverly constructed query to do prompt injection. Our goal is just to make accidental or faux-accidental scraping harder.)

First, note that what a mixer does is actually pretty simple. It takes a query from the UI, a secret OpenAI_API_Key it knows or is provided from the end-user by the ui, reaches out to a number of hosts to fetch bits of content, and then formulates a completion to pass to OpenAI. It's desirable that the mixer not actually pass the direct bits of content back up to the UI, since the ui might be a scraper. The bits of content comes in from the host, gets remixed into a full completion, goes to OpenAI, and the completion--but not the raw bits of content--is piped back to the UI.

Typically we imagine that whoever hosts the ui is also hosting the mixer (and sometimes the mixer is literally running clientside in a webapp UI). But imagine a headless mixer being operated by a third party service as a generic piece of infrastructured at a known location., e.g https://mixer.polymath.community. It is totally stateless. With each request it takes a user's query, an OpenAI key to use, a list of endpoints to reach out to (possibly with access tokens). Mixers have to be trusted by both hosts and uis. Hosts have to trust that it will not pass the unencrypted bits of content to anyone but OpenAI directly, and UIs have to trust that the mixer won't steal or store the user's OpenAI key or access tokens.

A mixer would have a public key, published in a known location relative to its endpoint, and keep its private key secret.

Let's imagine that certain hosts might insist on encrypting their bits of content that they return to one of the public keys of a set of mixers they enumerate, so that untrusted clients reaching out (e.g. a UI reaching out directly) will not be able to read the content. Only trusted mixers will be able to decrypt the content.

If the ui reaches out directly to a host that insists on encryption, it would get back a response indicating that the host requires using a pre-enumerated mixer. The ui would then choose which of the mixers it also trusted (remember that both the host and the ui need to trust the mixer in different ways) and reach out to it and ask it to reach out to the host on its behalf. The mixer would then reach out to the host and request that the content the host returns be encrypted by its public-key. The host would see that the mixer/public-key pair is one of its pre-enumerated trusted ones and encrypt the content. The mixer can then use its private key to decrypt the content, mix it in with content from other hosts, pass the whole prompt to OpenAI, and return the result to the UI.

This scheme would require uis to enumerate mixers they trust and are willing to work with, and hosts to also enumerate mixers they are willing to work with, and there would have to be an overlap of mutually trusted mixers. (Hosts and uis should never just blindly trust a mixer the other party suggested, because the host and ui both need to trust the mixer for different reasons). Mixers wouldn't need to do very much compute so would be cheap to operate; I imagine that there would only ever need to be a handful in operation (to have redundancy and a bit of decentralization) that everyone tended to trust.

In a world where there's some kind of remote attestation scheme that was widely trusted that could certify that a given mixer was running a VM associated with a certain SHA of a public git commit, then hosts and uis could trust mixers they hadn't pre-enumerated, as long as they trusted the remote attestation scheme (which would reduce down to "I trust the GCP, Azure, GCP, and Cloudflare remote attestation schemes")

All of this is just thinking out loud. It's probably wildly overcomplicated, completely unnecessary, totally naive, or some combination of all three.

dglazkov commented 1 year ago

The concept of mixer is very important. Thank you, Alex.

dalmaer commented 1 year ago

Yah, so much of the magic is:

getting the right context...
and when there are multiple shared brains to mix in...
and eventually working out which brains to bring in.

I'm seeing this with things like: helping my build something that used Polaris, Remix, and Preact when I have 3 shared brains to work with.

dglazkov / polymath

Figure out ways to discourage scraping #94