deepset-ai / haystack

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.58k stars 1.82k forks source link

Build a CSVToDocument Component #8036

Closed tradicio closed 1 day ago

tradicio commented 1 month ago

Is your feature request related to a problem? Please describe. Haystack currently supports conversion of several file formats such as .txt, .pdf and markdown. It would be very useful to also have a Component that converts a csv format file into a list of Document objects.

Describe the solution you'd like I would like to implement a CSVToDocument that will load CSV files into a sequence of Document objects. Each row of the CSV file should be translated to one document. I think this could be the best choice since it is true that each line of a CSV file is usually a different data record.

Each row could be converted into a key:value pair so that the Document output could be the following:

Document(id=XXX, content: 'column1: value1\ncolumn2: value2\ncolumn3: value3', meta: {'row': 0, 'source': './example.csv'})

Describe alternatives you've considered Unstructured is already present in Haystack so there is already a method to convert .csv files. Nevertheless, I think it is useful to have a component specifically designed for this purpose that does not go through the generation of an API key to communicate with an external service.

anakin87 commented 1 month ago

Hey, @tradicio! Just a quick clarification: Unstructured can also run locally (without API keys) using Docker, as specified here. (I understand it is not properly documented and I'll open an issue to improve docs...)

CarlosFerLo commented 1 month ago

@tradicio Thank you for your feature request and the detailed description. I understand that you are looking to implement a CSVToDocument component that converts each row of a CSV file into a separate Document object, with key-value pairs representing the column data.

However, I'd like to propose an alternative approach that could offer more flexibility and generality. The Document data class in Haystack has a dataframe property intended to hold tabular data as a pandas DataFrame. This suggests that an implementation of a CSV loader could initially load the entire CSV into this dataframe property, providing a comprehensive representation of the tabular data.

To address your specific use case of converting each CSV row into individual Document objects, we could introduce a document processor component. This component could be designed to transform the loaded document into the desired format. Specifically, it could accept a Callable[[List[Document]], List[Document]] at initialization, which would be applied in the run method to process documents as needed. This approach not only meets your requirement but also lays the foundation for expanding document preprocessing to be more multimodal-friendly.

This solution combines the best of both worlds: leveraging the existing capability to handle tabular data and providing the flexibility to process and convert documents into any desired format.

I hope this alternative is helpful and look forward to your thoughts on this approach.

tradicio commented 1 month ago

Thanks @CarlosFerLo for your response!

So far, I have not extensively used the Document property allowing pandas dataframe to be read. Personally, I have found it to be of little use in a real-world application, for example in the insertion of a CSV file within a DocumentStore and a subsequent embedding of the information contained within it.

For this reason, I thought of building an alternative solution. Given a CSV file as input, a CSVToDocument Component could build as output a list of Documents representing the list of CSV rows. In my opinion, those rows contain most of the information contained in the CSV file.

Anyhow, I completely agree that your solution could be more versatile and include more use cases.

s-a commented 1 month ago

is this related to #7784 ?

CarlosFerLo commented 1 month ago

@s-a I believe that although they are both new convertors, wither one of them handles different file types with different data encoded in them.

CarlosFerLo commented 1 month ago

@tradicio I have never used this capability of documents before, and I find no direct way to give it a use right now. I believe that this new components are the first start to building support for this type of data.

Regarding your formatting of the document content on single rows, I believe it would help by adding a prefix with a little bit of context about what the data describes. Also, ignoring fields such ass ids might also give you better performance. You could store the real row inside the dataframe field instead.

I encourage you to open a PR on the matter, in case you need something, just contact me or mention me on a comment.

srini047 commented 5 days ago

@tradicio @CarlosFerLo I have made the draft PR https://github.com/deepset-ai/haystack/pull/8307. Looking forward to your support and feedback on the same.