Azure / gpt-rag-ingestion

MIT License
52 stars 48 forks source link

Fix truncating #79

Closed placerda closed 2 months ago

placerda commented 2 months ago

Text processing improvement:

text_embedder.py: Modified the clean_text function to improve the way text is truncated when it exceeds the token limit. Instead of removing one character at a time, the function now removes a variable number of characters, starting with one and doubling every five iterations, up to a maximum of 100 characters per iteration. This change should make the truncation process more efficient for long texts.