Unstructured-IO / unstructured-ingest

Apache License 2.0
16 stars 19 forks source link

bug/bedrock-embedding-encoder-empty-text #55

Open jaredmcqueen opened 3 months ago

jaredmcqueen commented 3 months ago

Describe the bug bedrock embed_documents will raise an error if you pass it an element with empty text.

Call stack:
  File "<string>", line 1, in <module>
  File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/spawn.py", line 135, in _main
    return self._bootstrap(parent_sentinel)
  File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Users/jared/Projects/fireflower/queue/.venv/lib/python3.12/site-packages/unstructured/ingest/pipeline/reformat/embedding.py", line 61, in run
    logger.error(f"failed to embed content from file {elements_json}, {e}", exc_info=True)
Message: 'failed to embed content from file /Users/jared/.cache/unstructured/ingest/pipeline/chunked/b68b507cdffefba9c4c7e1ab19240c72.json, Error raised by inference endpoint: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: expected minLength: 1, actual: 0, please reformat your input and try again.'

Expected behavior bedrock embedding will skip over elements with empty data

Fix

This class method needs to add a check for an empty element.text field

    def embed_documents(self, elements: List[Element]) -> List[Element]:
        embeddings = self.bedrock_client.embed_documents([str(e) for e in elements])
        elements_with_embeddings = self._add_embeddings_to_elements(elements, embeddings)
        return elements_with_embeddings

to something like this:

    def embed_documents(self, elements: List[Element]) -> List[Element]:
        # Filter out elements with empty text
        non_empty_elements = [e for e in elements if e.text.strip()]

        # Only embed non-empty elements
        embeddings = self.bedrock_client.embed_documents(
            [str(e) for e in non_empty_elements]
        )
        elements_with_embeddings = self._add_embeddings_to_elements(
            non_empty_elements, embeddings
        )

        # Combine elements with embeddings and those without (empty text)
        result = elements_with_embeddings + [e for e in elements if not e.text.strip()]
        return result
scanny commented 3 months ago

@jaredmcqueen what elements are you getting that have no text?

A PageBreak element will routinely have no text, but if you're seeing others or elements that contain only whitespace we'll want to fix those upstream.