[x] I have looked for existing issues (including closed) about this
Feature Request: Implement CSV Loader for Document Processing
Motivation
As users of Rig often need to work with structured data stored in CSV files, we need a way to easily load and process CSV documents for use in RAG systems and other document processing tasks. A CSV loader would allow users to incorporate tabular data into their NLP pipelines, enhancing the versatility of Rig for various use cases such as data analysis, information retrieval, and content summarization.
Proposal
Implement a CsvLoader struct that implements the DocumentLoader trait. The loader should:
Accept a file path to a CSV file.
Parse the CSV file using the csv crate.
Convert the CSV data into a format suitable for embedding and further processing within Rig.
Handle potential errors such as file not found, parsing errors, or invalid CSV structures.
Provide options for customization, such as specifying delimiters or handling headers.
The implementation should focus on converting CSV data into a single document for embedding, with each row formatted as "header: value" pairs, separated by newlines.
Alternatives
Row-based Embedding: Instead of creating a single document, we could create separate embeddings for each row. This would allow for more granular retrieval but might increase processing time and storage requirements.
Drawbacks: Increased complexity in implementation and potential performance impact for large CSV files.
Using pandas-like library: We could use a more robust data processing library like polars to handle CSV files, which might offer more advanced features for data manipulation.
Drawbacks: Introduces a heavier dependency, which might not be necessary for simple CSV processing.
Custom parsing without csv crate: We could implement CSV parsing without relying on the csv crate, giving us more control over the parsing process.
Drawbacks: Reinventing the wheel, potentially introducing bugs, and increasing maintenance burden.
The proposed solution was chosen because it offers a good balance between simplicity, performance, and flexibility. It leverages the well-tested csv crate for parsing while allowing for future enhancements if more advanced features are needed.
Feature Request: Implement CSV Loader for Document Processing
Motivation
As users of Rig often need to work with structured data stored in CSV files, we need a way to easily load and process CSV documents for use in RAG systems and other document processing tasks. A CSV loader would allow users to incorporate tabular data into their NLP pipelines, enhancing the versatility of Rig for various use cases such as data analysis, information retrieval, and content summarization.
Proposal
Implement a
CsvLoader
struct that implements theDocumentLoader
trait. The loader should:csv
crate.The implementation should focus on converting CSV data into a single document for embedding, with each row formatted as "header: value" pairs, separated by newlines.
Alternatives
Row-based Embedding: Instead of creating a single document, we could create separate embeddings for each row. This would allow for more granular retrieval but might increase processing time and storage requirements.
Drawbacks: Increased complexity in implementation and potential performance impact for large CSV files.
Using pandas-like library: We could use a more robust data processing library like polars to handle CSV files, which might offer more advanced features for data manipulation.
Drawbacks: Introduces a heavier dependency, which might not be necessary for simple CSV processing.
Custom parsing without csv crate: We could implement CSV parsing without relying on the
csv
crate, giving us more control over the parsing process.Drawbacks: Reinventing the wheel, potentially introducing bugs, and increasing maintenance burden.
The proposed solution was chosen because it offers a good balance between simplicity, performance, and flexibility. It leverages the well-tested
csv
crate for parsing while allowing for future enhancements if more advanced features are needed.