deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.81k stars 1.84k forks source link

`convert_files_to_docs` for a list of filepaths not `dir_path` #5616

Closed DanShatford closed 11 months ago

DanShatford commented 1 year ago

Is your feature request related to a problem? Please describe. I have a DocumentSearchPipeline built from files. I would like to process files at query time in the same way as convert_files_to_docs, normally many fewer than at index time.

The convert_files_to_docs function only allows me to input a dir_path. I would like to be able to input a list of file paths with different extensions.

Describe the solution you'd like A utility function or generic FileConverter that would allow me to process documents in the same way both when indexing and when querying.

Describe alternatives you've considered

Additional context I can make a PR if this feature would be accepted, but I'm interested in what the preferred API would look like.

masci commented 11 months ago

Hi @DanShatford sorry for the late reply!

I would add an optional parameter to convert_files_to_docs taking a List[Path] and simply append those to the files that were found in dir_path (if any).

DanShatford commented 11 months ago

I'll make a PR for this.