deepset-ai / haystack-core-integrations

Additional packages (components, document stores and the likes) to extend the capabilities of Haystack version 2.0 and onwards
https://haystack.deepset.ai
Apache License 2.0
105 stars 99 forks source link

unstructured: metadata get mixed up #331

Closed lambda-science closed 7 months ago

lambda-science commented 7 months ago

Describe the bug When indexing files with Unstructured component. Metadata get mixed (?) For example with these data:

    files = ["test/samples/sample1.pdf", "test/samples/sample2.pdf", "test/samples/sample3.pdf" ]
    meta = [
        {"meta1": "value1", "source_test": "pytest_api"},
        {"meta2": "value2", "source_test": "pytest_api"},
        {"meta3": "value3", "source_test": "pytest_api"},
    ]

Then these data are converted to json to make a call to my API I get these results :

    {
        "id": "b89ce3cc7ed9839459d1606018cf6beb720df0424515cd4cc9442b51970f72b8",
        "content": "blablablalba",
        "meta": {
            "meta2": "value2",
            "source_test": "pytest_api",
            "filename": "5f27_sample1.pdf",
            "s3_key": "bc6c_sample2.pdf",
            "file_path": "C:\\Users\\cmeyer\\code-project\\llm-ale-chatbot\\haystack_api\\rest_api\\file-upload\\5f27_sample1.pdf",
            "languages": [
                "fra"
            ],
            "page_number": 1,
            "filetype": "application/pdf",
            "category": "UncategorizedText"
        },
        "score": 0.0
    },

Here the content correspond to the right file_path & filename (generated by unstructured) BUT my CUSTOM metadata that are not generated by Unstructured processing are mixed up (meta2, and s3_key are wrong) !
Sorry for the not very reproducible example. I'm just writing to know if someone already had similar issue. I make a detailed report this week-end. This doesn't happen with PyPDF, so it's weird.

Describe your environment (please complete the following information):

lambda-science commented 7 months ago

Might be related by an error in my implementation of metadata field here: https://github.com/deepset-ai/haystack-core-integrations/pull/242 Where could it come from ...

anakin87 commented 7 months ago

Waiting for your detailed report to have a proper look. Thanks!

lambda-science commented 7 months ago

Waiting for your detailed report to have a proper look. Thanks!

Coming back with news @anakin87
Identification of the bug, in:

    @component.output_types(documents=List[Document])
    def run(
        self,
        paths: Union[List[str], List[os.PathLike]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
    ):
        """
        Convert files to Haystack Documents using the Unstructured API (hosted or running locally).

        :param paths: List of paths to convert. Paths can be files or directories.
            If a path is a directory, all files in the directory are converted. Subdirectories are ignored.
        :param meta: Optional metadata to attach to the Documents.
          This value can be either a list of dictionaries or a single dictionary.
          If it's a single dictionary, its content is added to the metadata of all produced Documents.
          If it's a list, the length of the list must match the number of paths, because the two lists will be zipped.
          Please note that if the paths contain directories, the length of the meta list must match
          the actual number of files contained.
          Defaults to `None`.
        """
        unique_paths = {Path(path) for path in paths}
        filepaths = {path for path in unique_paths if path.is_file()}
        filepaths_in_directories = {
            filepath for path in unique_paths if path.is_dir() for filepath in path.glob("*.*") if filepath.is_file()
        }

        all_filepaths = filepaths.union(filepaths_in_directories)
        # currently, the files are converted sequentially to gently handle API failures
        documents = []
        meta_list = normalize_metadata(meta, sources_count=len(all_filepaths))

We use a set unique_paths = {Path(path) for path in paths} here and in Python set are not ordered. After converting our filepaths to set, the metadata order doesn't correspond to the filepaths order. This leads to attribution of metadata to the wrong filepaths. We should modify the logic here to not use set maybe ? I will try to think of a solution

lambda-science commented 7 months ago

And actually I'm not sure why we need a set logic here to make filepath unique. I feel like it's up to the user to provide unique paths ? For example what happens if a user provide 10 path and 10 metadata but then some filepath are duplicated so then we have 8 filepath and 10 metadata ? It will raise error from normalize_metadata I guess

What I think is that:

  1. We can support directories as path BUT then metadata should max be of a length of 1 (same metadata for all files in directory). Because I'm not sure it's clear how path.glob() orders files (leading to metadata attribution confusion)
  2. If direct paths to files are provided: don't make them unique with sets.
anakin87 commented 7 months ago

Released a new version with the bugfix: https://pypi.org/project/unstructured-fileconverter-haystack/0.3.1/