dssjon / biblos

www.biblos.app
http://www.biblos.app
Other
197 stars 14 forks source link

Max INSTRUCTOR length appears to be 512 #12

Closed HanClinto closed 11 months ago

HanClinto commented 11 months ago

Looks like any text input lengths > 512 are truncated:

https://github.com/xlang-ai/instructor-embedding/issues/72

We're currently passing 1000 to our text chunker, so we may need to lower this down or risk truncating text.

dssjon commented 11 months ago

I think the tokenizer in the embedding model processes input at the word level, not at the character level. We can test the total word count in the DB to validate.

HanClinto commented 11 months ago

Oooh, that's a really great point. Thank you!

dssjon commented 11 months ago

PR https://github.com/dssjon/biblos/pull/14/files adds rough logic to compare word counts from the source text to the embedded db, reflecting no text truncation.