dglazkov / polymath

MIT License
133 stars 9 forks source link

More savvy ways of selecting chunks for the context #14

Open jkomoros opened 1 year ago

jkomoros commented 1 year ago

Now we will be selecting chunks from across a number of end points.

The simplest way to select chunks for the context is split the available context tokens by the number of sources, and then fill up that much context from each source. But some sources will have more similar chunks than others.

Another approach is to fetch enough chunks from each endpoint that any of the endpoints' chunks could fill up the whole context. Then merge them all together and take the most similar, walking down the list until all of the context is filled.

But this has two additional problems: 1) it might lead to one very chatty endpoint dominating all of the context, and 2) one relevant but verbose piece of context might take all the space.

For that reason, the ranking of chunks should probably take the overall amount of similarity and divide it by number of tokens, a bang-for-buck of tokens. Second, we should rank the similarity with a twiddle based on the log of number of tokens already selected for that source (or some other fall off function). That means that we'll try to select one chunk from each source unless they really aren't good.

Perhaps there should be multiple strategies that are all configurable.

Perhaps it should be possible to allow the runner of the final client to configure boosts for endpoints based on how much they want them to contribute to the final result.

Related to #8.

jkomoros commented 1 year ago
jkomoros commented 1 year ago

Hmmm investigating get_chunk_infos_for_library and how to handle sorted things, I realize that I think I based the design on a thing ChatGPT hallucinated, that dict keys in python are in order.

That implies that the first order of business is to make it so libraries actually do maintain their intended sorted order that survives across serialization and deserialization.