dglazkov / polymath

MIT License
133 stars 9 forks source link

Rationalize the server POST parameters and return type #18

Open dglazkov opened 1 year ago

dglazkov commented 1 year ago

To switch https://github.com/dglazkov/wanderer to rely on polymath, it needs a way to query a random set of text chunks, to reproduce this functionality here:

https://github.com/dglazkov/wanderer/blob/main/ask_embeddings.py#L119

dglazkov commented 1 year ago

Thinking maybe the endpoint API needs to have several modes:

dglazkov commented 1 year ago

Also, API should always try to return data in the library format -- right?

dglazkov commented 1 year ago

@jkomoros penny for your thoughts

jkomoros commented 1 year ago

Yeah I was thinking about this yesterday too. I think that the data should always be returned in the library format (or a subset of it) just for simplicity about format versioning.

The date filtering use case above implies that the info property needs to also grow a timestamp field (how should it be formatted?)

If we use the library format then we'd also need a query_similiarity on each chunk that is usually omitted but required when responding to a similarity query.

We'd also need a way to define that certain properties can be omitted in certain cases. For example for a similarity_query mode embedding can/should be dropped because the embeddings are large. And timestamp should almost always be dropped (it's only useful for date restrict queries)

I wonder if the library format has a top-level mode enum that is one of a handful of known types, where each one has a set of required properties, making it easier to validate that your host is compliant (e.g. "I issued a query_similarity command but you returned a mode='full' response" or "Your library defined a mode of 'query' but didn't include query_similarity on chunk with id abcdef")

One final thought: if one of the modes is just top-level summary of the library (even omitting all of the actual chunks) then we should probably have a chunk_count top level field in the library format.

jkomoros commented 1 year ago

Playing around with how to represent the use cases. POSTs can have the following arguments:

count: An integer representing the max number of items (token or chunk, depending on 'count_type') to retrieve. Defaults to 10 if not provided. A negative number means 'all'. The endpoint might return fewer than this count.
count_type: {'token', 'chunk'} Defaults to 'chunk'. If 'token', then we will return a number of chunks up to but not exceeding the count number of tokens.
query: a base64-encoded embedding of the query. May be omitted.
query_embedding_model: defaults to 'text-embedding-ada-002'
filter_before: filters out chunks whose timestamp is before this date, in ISO 8601 format. Optional.
filter_after: filters out chunks whose timestamp is after this date, in ISO 8601 format. Optional.
omit: A comma-separated list of the following keys. Any key that is included will be omitted from chunks. Defaults to 'embedding'. 
  - '*' - Exclude all info from chunks, which means the entire chunks dict will be totally empty, however `chunk_count` in library will be set with how many chunks would have been returned.
  - '' - If the value is fully empty, then all content is included.
  - 'embedding' - The embedding field is omitted
  - 'info' - The info field is omitted
sort:
  - 'similarity' - (default) sorted by descending similarity to query. If no query is provided, it's equivalent to 'any'.
  - 'any' - No particular order.
  - 'random' - A random sort seeded by `seed`
  - 'timestamp' - Descending ordered by timestamp
sort_reversed: '1' if the sort should be ascending instead of descending.
seed: A string to seed the random sort in order to get deterministic results. If not provided, defaults to a seed derived from Date.now()

ChatGPT tells me a best practice for timestamps is to use ISO 8601 formatting, e.g. YYYY-MM-DDTHH:mm:ss.sssZ

dglazkov commented 1 year ago

Are we building a query language 😆 ?

jkomoros commented 1 year ago

Maybe?

Two more thoughts:

1) do I have the semantics of filter_before and filter_after backwards? E.g. imagining in the future that we'd add e.g. filter_url that only includes chunks whose info.url matches a URL pattern. But the way I wrote the before/after filtering semantics above, it's filtering OUT things before or after the given date instead of filtering IN. 2) Should query actually be query_embedding? In the future it might be some cases where the raw text of the query is included, the embedding is calculated on the host (that would allow a library with a different embedding_model to interoperate seamlessly, at the cost of pushing API usage to the hosts and opening up a line of DOS style attacks to exhaust their API key). If we do that, the most obvious key for the query is just query.

jkomoros commented 1 year ago

OK, updating based on above:

version: required. the version number the client speaks. The host should return a library with a version number at or below this number. If it cannot, it should return an error.
count: An integer representing the max number of items (token or chunk, depending on 'count_type') to retrieve. Defaults to 10 if not provided. A negative number means 'all'. The endpoint might return fewer than this count.
count_type: {'token', 'chunk'} Defaults to 'chunk'. If 'token', then we will return a number of chunks up to but not exceeding the count number of tokens.
query_embedding: a base64-encoded embedding of the query. May be omitted. If provided, each chunk returned will also include a 'similarity' property, a float representing the similarity to the query, the dot product of the query_embedding and each chunk's embedding.
query_embedding_model: The embedding type used for `query_embedding`. Required if `query_embedding` is set. Currently the value must be 'text-embedding-ada-002' if it's set.
filter_before: includes only chunks whose timestamp is before this date, in ISO 8601 format. Optional.
filter_after: includes only chunks whose timestamp is after this date, in ISO 8601 format. Optional.
omit: A comma-separated list of the following keys. Any key that is included will be omitted from chunks. Defaults to 'embedding'. 
  - '*' - Exclude all info from chunks, which means the entire chunks dict will be totally empty, however `chunk_count` in library will be set with how many chunks would have been returned.
  - '' - If the value is fully empty, then all content is included.
  - 'embedding' - The embedding field is omitted
  - 'info' - The info field is omitted
sort:
  - 'similarity' - (default) sorted by descending similarity to query. If no query is provided, it's equivalent to 'any'.
  - 'any' - No particular order.
  - 'random' - A random sort seeded by `seed`
  - 'timestamp' - Descending ordered by timestamp
sort_reversed: '1' if the sort should be ascending instead of descending.
seed: A string to seed the random sort in order to get deterministic results. If not provided, defaults to a seed derived from Date.now()
jkomoros commented 1 year ago