Open NuiMrme opened 1 week ago
Same issue here. Upon further investigation, it appears the issue is related to empty strings present in the input array.
As stated by API Reference / embeddings-create-input:
Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. The input must not exceed the max input tokens for the model (8192 tokens for text-embedding-ada-002), cannot be an empty string, and any array must be 2048 dimensions or less
After manually removing the empty strings from the payload generated by open-parse, everything functioned as expected:
This should probably be included in the inner validations of open-parse.
@NuiMrme in the meantime, I recommend sanitizing your input array to ensure that no empty nodes are sent.":
# annotate the document with similarity scores
sanitized_nodes = [node for node in parsed_content.nodes if node.text.strip() != ""] # remove empty nodes
annotations = get_node_similarities(sanitized_nodes)
doc.display_with_bboxes(
sanitized_nodes, annotations=annotations
)
I think it is even more than empty strings , I remember passing by this sanitization but it didn't solve the isse. My text contains a whole lot of different symbols and stuff. So I solved the problem basically getting rid of the batching mechanism which sometimes the cutt-off will result in a bad string. To be honest I don't know the purpose of batching on this , if we have our chunk already. If it serves nothing and causes problems maybe it should be considered for deletion ?
That's strange - should not be happening because we have a step that filters out Nodes with less than 50 tokens called RemoveNodesBelowNTokens
. Does someone have a sample pdf?
As I mentioned earlier I don't think it is about empty strings, but about bad strings when the batching cut the string at a position with a special symbol or something, removing the batches solved the issue completely. Question is why we need batches to begin with ?
Without batching you'll end up with a bunch of small, fragmented nodes. So a heading will be separated from the paragraph, bulleted lists will be split apart, etc.
With that said you can disabled all transforms by passing an empty pipeline
parser = openparse.DocumentParser(processing_pipeline=None)
I'm talking about this batching here, the fixed size 256 batch, it doesn't help the case and bulleted lists will be divided too
`def embed_many(self, texts: List[str]) -> List[List[float]]: """ Generate embeddings for a list of texts in batches.
Args:
texts (list[str]): The list of texts to embed.
batch_size (int): The number of texts to process in each batch.
Returns:
List[List[float]]: A list of embeddings.
"""
res = []
for i in range(0, len(texts), self.batch_size):
batch_texts = texts[i : i + self.batch_size]
api_resp = self.client.embeddings.create(
input=batch_texts, model=self.model
)
batch_res = [val.embedding for val in api_resp.data]
res.extend(batch_res)
return res`
Batching speeds up requesting embeddings 250X? This issue is upstream - if you give the OpenAI api null strings it will error out. This function just passes data to OpenAI - it's actually from llama-index.
Same error with langchain on parsed_document = parser.parse(file_path, ocr=True)
.
This is an OpenParse problem. Thing is, it worked at first then all of a sudden the error is reproduced.
Initial Checks
Description
this is basically almost the same as the one in the examples , it comes back with an an openai error : BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}
Example Code
No response
Python, open-parse & OS Version