aurelio-labs / semantic-router

Superfast AI decision making and intelligent processing of multi-modal data.
https://www.aurelio.ai/semantic-router
MIT License
1.83k stars 185 forks source link

encoding issues #351

Closed NaveenVinayakS closed 1 month ago

NaveenVinayakS commented 1 month ago

Hi All ,

I am new to this semantic router, i am using the below code

import os from getpass import getpass from semantic_router.encoders import OpenAIEncoder

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass("OpenAI API key: ")

encoder = OpenAIEncoder(name="text-embedding-3-small")

from semantic_router.splitters import RollingWindowSplitter from semantic_router.utils.logger import logger

logger.setLevel("WARNING") # reduce logs from splitter

splitter = RollingWindowSplitter( encoder=encoder, dynamic_threshold=True, min_split_tokens=100, max_split_tokens=500, window_size=2, plot_splits=True, # set this to true to visualize chunking enable_statistics=True # to print chunking stats )

i was thinking i am using openai models for encoding..but when i explored the code of tiktoken_length()

def tiktoken_length(text: str) -> int: tokenizer = tiktoken.get_encoding("cl100k_base") tokens = tokenizer.encode(text, disallowed_special=()) return len(tokens)

is it using openai models or cl100k_base for encodings.
jamescalam commented 1 month ago

hi @NaveenVinayakS, tiktoken and cl100k_base are the tokenizers used by OpenAI embedding models, we use it to count the number of tokens within a single embedding.

I would recommend swapping the RollingWindowSplitter here for the StatisticalChunker from the semantic-chunkers library, it is a much more optimized version of RollingWindowSplitter, particularly when used with async. You can find examples for it here:

NaveenVinayakS commented 1 month ago

i understood how its working thanks for response.