Open dglazkov opened 1 year ago
Thinking maybe the endpoint API needs to have several modes:
Also, API should always try to return data in the library format -- right?
@jkomoros penny for your thoughts
Yeah I was thinking about this yesterday too. I think that the data should always be returned in the library format (or a subset of it) just for simplicity about format versioning.
The date filtering use case above implies that the info
property needs to also grow a timestamp
field (how should it be formatted?)
If we use the library format then we'd also need a query_similiarity
on each chunk that is usually omitted but required when responding to a similarity query.
We'd also need a way to define that certain properties can be omitted in certain cases. For example for a similarity_query mode embedding
can/should be dropped because the embeddings are large. And timestamp
should almost always be dropped (it's only useful for date restrict queries)
I wonder if the library format has a top-level mode
enum that is one of a handful of known types, where each one has a set of required properties, making it easier to validate that your host is compliant (e.g. "I issued a query_similarity command but you returned a mode='full' response" or "Your library defined a mode of 'query' but didn't include query_similarity
on chunk with id abcdef
")
One final thought: if one of the modes is just top-level summary of the library (even omitting all of the actual chunks) then we should probably have a chunk_count
top level field in the library format.
Playing around with how to represent the use cases. POSTs can have the following arguments:
count: An integer representing the max number of items (token or chunk, depending on 'count_type') to retrieve. Defaults to 10 if not provided. A negative number means 'all'. The endpoint might return fewer than this count.
count_type: {'token', 'chunk'} Defaults to 'chunk'. If 'token', then we will return a number of chunks up to but not exceeding the count number of tokens.
query: a base64-encoded embedding of the query. May be omitted.
query_embedding_model: defaults to 'text-embedding-ada-002'
filter_before: filters out chunks whose timestamp is before this date, in ISO 8601 format. Optional.
filter_after: filters out chunks whose timestamp is after this date, in ISO 8601 format. Optional.
omit: A comma-separated list of the following keys. Any key that is included will be omitted from chunks. Defaults to 'embedding'.
- '*' - Exclude all info from chunks, which means the entire chunks dict will be totally empty, however `chunk_count` in library will be set with how many chunks would have been returned.
- '' - If the value is fully empty, then all content is included.
- 'embedding' - The embedding field is omitted
- 'info' - The info field is omitted
sort:
- 'similarity' - (default) sorted by descending similarity to query. If no query is provided, it's equivalent to 'any'.
- 'any' - No particular order.
- 'random' - A random sort seeded by `seed`
- 'timestamp' - Descending ordered by timestamp
sort_reversed: '1' if the sort should be ascending instead of descending.
seed: A string to seed the random sort in order to get deterministic results. If not provided, defaults to a seed derived from Date.now()
ChatGPT tells me a best practice for timestamps is to use ISO 8601 formatting, e.g. YYYY-MM-DDTHH:mm:ss.sssZ
Are we building a query language 😆 ?
Maybe?
Two more thoughts:
1) do I have the semantics of filter_before
and filter_after
backwards? E.g. imagining in the future that we'd add e.g. filter_url
that only includes chunks whose info.url matches a URL pattern. But the way I wrote the before/after filtering semantics above, it's filtering OUT things before or after the given date instead of filtering IN.
2) Should query
actually be query_embedding
? In the future it might be some cases where the raw text of the query is included, the embedding is calculated on the host (that would allow a library with a different embedding_model to interoperate seamlessly, at the cost of pushing API usage to the hosts and opening up a line of DOS style attacks to exhaust their API key). If we do that, the most obvious key for the query is just query
.
OK, updating based on above:
version: required. the version number the client speaks. The host should return a library with a version number at or below this number. If it cannot, it should return an error.
count: An integer representing the max number of items (token or chunk, depending on 'count_type') to retrieve. Defaults to 10 if not provided. A negative number means 'all'. The endpoint might return fewer than this count.
count_type: {'token', 'chunk'} Defaults to 'chunk'. If 'token', then we will return a number of chunks up to but not exceeding the count number of tokens.
query_embedding: a base64-encoded embedding of the query. May be omitted. If provided, each chunk returned will also include a 'similarity' property, a float representing the similarity to the query, the dot product of the query_embedding and each chunk's embedding.
query_embedding_model: The embedding type used for `query_embedding`. Required if `query_embedding` is set. Currently the value must be 'text-embedding-ada-002' if it's set.
filter_before: includes only chunks whose timestamp is before this date, in ISO 8601 format. Optional.
filter_after: includes only chunks whose timestamp is after this date, in ISO 8601 format. Optional.
omit: A comma-separated list of the following keys. Any key that is included will be omitted from chunks. Defaults to 'embedding'.
- '*' - Exclude all info from chunks, which means the entire chunks dict will be totally empty, however `chunk_count` in library will be set with how many chunks would have been returned.
- '' - If the value is fully empty, then all content is included.
- 'embedding' - The embedding field is omitted
- 'info' - The info field is omitted
sort:
- 'similarity' - (default) sorted by descending similarity to query. If no query is provided, it's equivalent to 'any'.
- 'any' - No particular order.
- 'random' - A random sort seeded by `seed`
- 'timestamp' - Descending ordered by timestamp
sort_reversed: '1' if the sort should be ascending instead of descending.
seed: A string to seed the random sort in order to get deterministic results. If not provided, defaults to a seed derived from Date.now()
chunk_count
to libraryquery_embedding
isn't providedcount
count_type
omit
random
sortany
sortsort_reversed
seed
info
fields (including to the converters)filter_before
and filter_after
To switch https://github.com/dglazkov/wanderer to rely on polymath, it needs a way to query a random set of text chunks, to reproduce this functionality here:
https://github.com/dglazkov/wanderer/blob/main/ask_embeddings.py#L119