Expose a tokenizer function in code blocks

happysalada commented 1 year ago

The main idea would be to be able to split text in windows of token to be able to fit into the context windows of the llms. example: take these answers, group them in chunks of 4000 tokens, summarize each of those chunks, then group and summarize them recursively until you have 1 chunk of 4000 tokens that can be used to answer an original question.

spolu commented 1 year ago

Yes this is definitely on our radar. It is likely that we will expose these functions as part of code blocks in the near future :+1: Will keep that issue open to track progress.

cmirdesouza commented 1 year ago

In Dust.tt, you can use JSONL for your datasets. I use tokenizers functions to write my JSONLs with X tokens per Line, and then Dust.tt does the rest. Regarding tokenizers, here are some library suggestions: gpt-tokenizer, tiktoken, and gpt-3-encoder. For code examples for any of these, please leave a comment and I will provide more details.

I recommend the following article, which provides an in-depth explanation on how to achieve effective recursive summarization. Although it's slightly different from what you asked for, it's a good starting point. 'In summary, our results show that combining recursive task decomposition with learning from human feedback can be a practical approach to scalable oversight for difficult long-document NLP tasks.' (Recursively Summarizing Books with Human Feedback)

dust-tt / dust

Expose a tokenizer function in code blocks #419