Embedding length calculation errors

uglyrobot commented 1 year ago

I'm getting this on a convert of nakedlibrary (split on new lines)

openai.Embedding.create error: This model's maximum context length is 8191 tokens, however you requested 9501 tokens (9501 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
Retrying in 20 seconds ... (many times)
Fetching token_count for 2b66e37ecae038884e0f3e825fe13349c7ba16f03a50fb428015667cac0cdb6c
Token indices sequence length is longer than the specified maximum sequence length for this model (9956 > 1024). Running this sequence through the model will result in indexing errors
Saving prematurely due to crash:  2b66e37ecae038884e0f3e825fe13349c7ba16f03a50fb428015667cac0cdb6c had the wrong length of embedding, expected 1536
Traceback (most recent call last):
  File "/Users/aaron/Documents/polymath/convert/main.py", line 157, in <module>
    result.insert_bit(bit)
  File "/Users/aaron/Documents/polymath/polymath/library.py", line 722, in insert_bit
    bit._set_library(self)
  File "/Users/aaron/Documents/polymath/polymath/library.py", line 216, in _set_library
    self.validate()
  File "/Users/aaron/Documents/polymath/polymath/library.py", line 179, in validate

dglazkov commented 1 year ago

Yikes! Yeah, we probably need to talk about sizes and limits for the nakedlibrary. Or maybe nakedlibrary import needs to use chunker?

As a way out of this particular situation, I would highly recommend running the content through chunker first to get the right-sized chunks.

The example usage is here: https://github.com/dglazkov/polymath/blob/main/convert/markdown.py#L59

jkomoros commented 1 year ago

Hmmm nakedlibrary importer does run the text through generate_chunks.

@uglyrobot I'm guessing that one of your chunks of text as a single line that is extraordinarily long? Can you confirm?

@dglazkov that implies to me that generate_chunks should forcibly break up content that is very long into multiple chunks, perhaps breaking at sentence boundaries first and then failing that at a word boundary and failing that just hard breaking it in the middle of a run of characters?

uglyrobot commented 1 year ago

Yes I broke it up on newlines. It was a long one but more importantly that didn't fail gracefully.

It retried 10 times with sleep even though it was a permanent error
Even though get_embedding returned None, the bit was not skipped in convert.main (kindof weird python error) so it got stuck importing that bit over and over.

dglazkov commented 1 year ago

I'd love to see the input. Would you be up for sharing?

dglazkov / polymath

Embedding length calculation errors #108