encoding issues - Githubissues

NaveenVinayakS commented 1 month ago

Hi All ,

I am new to this semantic router, i am using the below code

import os from getpass import getpass from semantic_router.encoders import OpenAIEncoder

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass("OpenAI API key: ")

encoder = OpenAIEncoder(name="text-embedding-3-small")

from semantic_router.splitters import RollingWindowSplitter from semantic_router.utils.logger import logger

logger.setLevel("WARNING") # reduce logs from splitter

splitter = RollingWindowSplitter( encoder=encoder, dynamic_threshold=True, min_split_tokens=100, max_split_tokens=500, window_size=2, plot_splits=True, # set this to true to visualize chunking enable_statistics=True # to print chunking stats )

i was thinking i am using openai models for encoding..but when i explored the code of tiktoken_length()

def tiktoken_length(text: str) -> int: tokenizer = tiktoken.get_encoding("cl100k_base") tokens = tokenizer.encode(text, disallowed_special=()) return len(tokens)

is it using openai models or cl100k_base for encodings.

jamescalam commented 1 month ago

hi @NaveenVinayakS, tiktoken and cl100k_base are the tokenizers used by OpenAI embedding models, we use it to count the number of tokens within a single embedding.

I would recommend swapping the RollingWindowSplitter here for the StatisticalChunker from the semantic-chunkers library, it is a much more optimized version of RollingWindowSplitter, particularly when used with async. You can find examples for it here:

sync version here
async version here (this can reduce chunking times of ~1 minute to ~4 seconds)

NaveenVinayakS commented 1 month ago

i understood how its working thanks for response.

aurelio-labs / semantic-router

encoding issues #351