dgarnitz / vectorflow

VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.
https://www.getvectorflow.com/
Apache License 2.0
670 stars 47 forks source link

Research whether `extract_for_token_limit` needs to be updated to support 1106 models. #98

Closed leohpark closed 6 months ago

leohpark commented 10 months ago

Hello from linkedin!

I noticed the following function appears to assume the user will use one of ["gpt-4", "gpt-4-32k", "or gpt-3.5-turbo-16k"], and otherwise make some assumptions about model context limits (defaults to gpt-4), and returns a portion of document[] as a function of remaining_tokens. Assuming people are using the 1106 models, is this function still doing what is intended?

    def extract_for_token_limit(self, document, questions):
        encoding = tiktoken.encoding_for_model(self.model)
        question_string = ",".join(questions)
        questions_count = len(encoding.encode(question_string))
        user_prompt_count = len(encoding.encode(self.usecase_enhancement_user_prompt))
        system_prompt_count = len(encoding.encode(self.usecase_enhancement_system_prompt))
        extra_count = len(encoding.encode("'role', 'system', 'content', 'role', 'user', 'content'"))
        token_limit = 8192

        if "16k" in self.model:
            token_limit = 16384
        elif "32k" in self.model:
            token_limit = 32768

        remaining_tokens = token_limit - (questions_count + user_prompt_count + system_prompt_count + extra_count)
        document_encoding = encoding.encode(document)
        if len(encoding.encode(document)) <= remaining_tokens:
            return document

        #return encoding.decode(document_encoding[:remaining_tokens])
        return document[:remaining_tokens*3]

additionally, I would suggest using a proper text splitter tool in order to return a more precise slice of the document based on actual token count. There are examples at both extremes where tokens-per-character is not near the assumed values of 3 and 4.

extreme tokens to characters

dgarnitz commented 10 months ago

Hey thanks for commenting. To answer a few of your concerns:

leohpark commented 10 months ago

thanks for the response, makes sense. I'm pretty sure I've run into tiktoken bugs too, and they nearly drove me mad.

dgarnitz commented 6 months ago

It seems like there is not any pressing need on this so I am closing it out