Closed lambda-science closed 7 months ago
Might be related by an error in my implementation of metadata field here: https://github.com/deepset-ai/haystack-core-integrations/pull/242 Where could it come from ...
Waiting for your detailed report to have a proper look. Thanks!
Waiting for your detailed report to have a proper look. Thanks!
Coming back with news @anakin87
Identification of the bug, in:
@component.output_types(documents=List[Document])
def run(
self,
paths: Union[List[str], List[os.PathLike]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
):
"""
Convert files to Haystack Documents using the Unstructured API (hosted or running locally).
:param paths: List of paths to convert. Paths can be files or directories.
If a path is a directory, all files in the directory are converted. Subdirectories are ignored.
:param meta: Optional metadata to attach to the Documents.
This value can be either a list of dictionaries or a single dictionary.
If it's a single dictionary, its content is added to the metadata of all produced Documents.
If it's a list, the length of the list must match the number of paths, because the two lists will be zipped.
Please note that if the paths contain directories, the length of the meta list must match
the actual number of files contained.
Defaults to `None`.
"""
unique_paths = {Path(path) for path in paths}
filepaths = {path for path in unique_paths if path.is_file()}
filepaths_in_directories = {
filepath for path in unique_paths if path.is_dir() for filepath in path.glob("*.*") if filepath.is_file()
}
all_filepaths = filepaths.union(filepaths_in_directories)
# currently, the files are converted sequentially to gently handle API failures
documents = []
meta_list = normalize_metadata(meta, sources_count=len(all_filepaths))
We use a set unique_paths = {Path(path) for path in paths}
here and in Python set
are not ordered. After converting our filepaths to set
, the metadata order doesn't correspond to the filepaths order. This leads to attribution of metadata to the wrong filepaths.
We should modify the logic here to not use set maybe ? I will try to think of a solution
And actually I'm not sure why we need a set logic here to make filepath unique. I feel like it's up to the user to provide unique paths ? For example what happens if a user provide 10 path and 10 metadata but then some filepath are duplicated so then we have 8 filepath and 10 metadata ? It will raise error from normalize_metadata
I guess
What I think is that:
path.glob()
orders files (leading to metadata attribution confusion)Released a new version with the bugfix: https://pypi.org/project/unstructured-fileconverter-haystack/0.3.1/
Describe the bug When indexing files with Unstructured component. Metadata get mixed (?) For example with these data:
Then these data are converted to json to make a call to my API I get these results :
Here the content correspond to the right file_path & filename (generated by unstructured) BUT my CUSTOM metadata that are not generated by Unstructured processing are mixed up (meta2, and s3_key are wrong) !
Sorry for the not very reproducible example. I'm just writing to know if someone already had similar issue. I make a detailed report this week-end. This doesn't happen with PyPDF, so it's weird.
Describe your environment (please complete the following information):