deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.2k stars 1.88k forks source link

Add meta field to `FileTypeRouter` #8465

Open bilgeyucel opened 1 day ago

bilgeyucel commented 1 day ago

Is your feature request related to a problem? Please describe. When a preprocessing pipeline starts with FileTypeRouter, which is usually the case when we use multiple converters, it's not possible to provide meta information for files

Describe the solution you'd like Let's add meta input to the FileTypeRouter and this component can use ByteStream dataclass to pass this info to converters.

Describe alternatives you've considered Having separate metadata outputs for each file type: router.text/plain_meta

Additional context The same issue opened a year ago #6392

cc: @silvanocerza

silvanocerza commented 1 day ago

Some additional context.

As of now the FileTypeRouter doesn't give the users an explicit way to pass additional metadata to converters that receive the routed sources.

The FileTypeRouter is also wrong right now cause it states that all its outputs are of type List[Path], that's incorrect cause it should actually be List[Union[Path, ByteStream]]. Basically the same as its sources input with str, cause internally str are converted to Path and returned that way.

I propose we fix the output type so that it correctly reflects the actual output. Additionally we change the FileTypeRouter to convert all the input sources to ByteStream if any meta is sent by the user, that way we can route the files together with the meta without adding new outputs to the Component. We must convert to ByteStream only if meta is received.