Closed HanClinto closed 11 months ago
I think the tokenizer in the embedding model processes input at the word level, not at the character level. We can test the total word count in the DB to validate.
Oooh, that's a really great point. Thank you!
PR https://github.com/dssjon/biblos/pull/14/files adds rough logic to compare word counts from the source text to the embedded db, reflecting no text truncation.
Looks like any text input lengths > 512 are truncated:
https://github.com/xlang-ai/instructor-embedding/issues/72
We're currently passing 1000 to our text chunker, so we may need to lower this down or risk truncating text.