0xPlaygrounds / rig

⚙️🦀 Build portable, modular & lightweight Fullstack Agents
https://rig.rs
MIT License
153 stars 9 forks source link

feat: Add CSV Loader to Document Loaders in Rig #29

Open Tachikoma000 opened 1 month ago

Tachikoma000 commented 1 month ago

Feature Request: Implement CSV Loader for Document Processing

Motivation

As users of Rig often need to work with structured data stored in CSV files, we need a way to easily load and process CSV documents for use in RAG systems and other document processing tasks. A CSV loader would allow users to incorporate tabular data into their NLP pipelines, enhancing the versatility of Rig for various use cases such as data analysis, information retrieval, and content summarization.

Proposal

Implement a CsvLoader struct that implements the DocumentLoader trait. The loader should:

  1. Accept a file path to a CSV file.
  2. Parse the CSV file using the csv crate.
  3. Convert the CSV data into a format suitable for embedding and further processing within Rig.
  4. Handle potential errors such as file not found, parsing errors, or invalid CSV structures.
  5. Provide options for customization, such as specifying delimiters or handling headers.

The implementation should focus on converting CSV data into a single document for embedding, with each row formatted as "header: value" pairs, separated by newlines.

Alternatives

  1. Row-based Embedding: Instead of creating a single document, we could create separate embeddings for each row. This would allow for more granular retrieval but might increase processing time and storage requirements.

    Drawbacks: Increased complexity in implementation and potential performance impact for large CSV files.

  2. Using pandas-like library: We could use a more robust data processing library like polars to handle CSV files, which might offer more advanced features for data manipulation.

    Drawbacks: Introduces a heavier dependency, which might not be necessary for simple CSV processing.

  3. Custom parsing without csv crate: We could implement CSV parsing without relying on the csv crate, giving us more control over the parsing process.

    Drawbacks: Reinventing the wheel, potentially introducing bugs, and increasing maintenance burden.

The proposed solution was chosen because it offers a good balance between simplicity, performance, and flexibility. It leverages the well-tested csv crate for parsing while allowing for future enhancements if more advanced features are needed.