Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.47k stars 98 forks source link

BadRequestError: Error code: 400 #72

Open NuiMrme opened 1 week ago

NuiMrme commented 1 week ago

Initial Checks

Description

``` semantic_pipeline = processing.SemanticIngestionPipeline(
openai_api_key=openai_api_key,
model="text-embedding-3-large",
min_tokens=2,
max_tokens=4000,  
)
parser = DocumentParser(
    processing_pipeline=semantic_pipeline,
    table_args={"parsing_algorithm": "pymupdf",
    "table_output_format": "markdown",
                }
)

this is basically almost the same as the one in the examples , it comes back with an an openai error : BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

Example Code

No response

Python, open-parse & OS Version

python_version: 3.12.3
             operating_system: Linux
                   os_version: 6.8.0-45-generic
           open-parse version: 0.6.0
                 install path: /home/develop/Desktop/RAG_BSS/.bss_iln_venv/lib/python3.12/site-packages/openparse
               python version: 3.12.3 (main, Sep 11 2024, 14:17:37) [GCC 13.2.0]
                     platform: Linux-6.8.0-45-generic-x86_64-with-glibc2.39
             related packages: torchvision-0.19.1 tokenizers-0.20.0 PyMuPDF-1.24.11 pydantic-2.9.2 torch-2.4.1 transformers-4.45.2
leonardobaggio commented 1 week ago

Same issue here. Upon further investigation, it appears the issue is related to empty strings present in the input array.

image

image

As stated by API Reference / embeddings-create-input:

Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. The input must not exceed the max input tokens for the model (8192 tokens for text-embedding-ada-002), cannot be an empty string, and any array must be 2048 dimensions or less

After manually removing the empty strings from the payload generated by open-parse, everything functioned as expected:

{D6004621-9C3E-4652-BFF6-0ADD7DACDC84}


This should probably be included in the inner validations of open-parse.

@NuiMrme in the meantime, I recommend sanitizing your input array to ensure that no empty nodes are sent.":

# annotate the document with similarity scores
sanitized_nodes = [node for node in parsed_content.nodes if node.text.strip() != ""] # remove empty nodes
annotations = get_node_similarities(sanitized_nodes)
doc.display_with_bboxes( 
    sanitized_nodes, annotations=annotations
)
NuiMrme commented 1 week ago

I think it is even more than empty strings , I remember passing by this sanitization but it didn't solve the isse. My text contains a whole lot of different symbols and stuff. So I solved the problem basically getting rid of the batching mechanism which sometimes the cutt-off will result in a bad string. To be honest I don't know the purpose of batching on this , if we have our chunk already. If it serves nothing and causes problems maybe it should be considered for deletion ?

Filimoa commented 1 week ago

That's strange - should not be happening because we have a step that filters out Nodes with less than 50 tokens called RemoveNodesBelowNTokens. Does someone have a sample pdf?

NuiMrme commented 1 week ago

As I mentioned earlier I don't think it is about empty strings, but about bad strings when the batching cut the string at a position with a special symbol or something, removing the batches solved the issue completely. Question is why we need batches to begin with ?

Filimoa commented 1 week ago

Without batching you'll end up with a bunch of small, fragmented nodes. So a heading will be separated from the paragraph, bulleted lists will be split apart, etc.

With that said you can disabled all transforms by passing an empty pipeline

parser = openparse.DocumentParser(processing_pipeline=None)
NuiMrme commented 1 week ago

I'm talking about this batching here, the fixed size 256 batch, it doesn't help the case and bulleted lists will be divided too

`def embed_many(self, texts: List[str]) -> List[List[float]]: """ Generate embeddings for a list of texts in batches.

    Args:
        texts (list[str]): The list of texts to embed.
        batch_size (int): The number of texts to process in each batch.

    Returns:
        List[List[float]]: A list of embeddings.
    """
    res = []
    for i in range(0, len(texts), self.batch_size):
        batch_texts = texts[i : i + self.batch_size]
        api_resp = self.client.embeddings.create(
            input=batch_texts, model=self.model
        )
        batch_res = [val.embedding for val in api_resp.data]
        res.extend(batch_res)

    return res`
Filimoa commented 1 week ago

Batching speeds up requesting embeddings 250X? This issue is upstream - if you give the OpenAI api null strings it will error out. This function just passes data to OpenAI - it's actually from llama-index.

NuiMrme commented 55 minutes ago

Same error with langchain on parsed_document = parser.parse(file_path, ocr=True) . This is an OpenParse problem. Thing is, it worked at first then all of a sudden the error is reproduced.