IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
288 stars 129 forks source link

[Feature] Capability to chunk text for RAG systems #447

Closed Bytes-Explorer closed 2 weeks ago

Bytes-Explorer commented 3 months ago

Search before asking

Component

Transforms/Other

Feature

The goal is to add a new transform that can take in the extracted text and chunk it. The input will be parquet files where every document is stored in one row. The output will be chunks, such that every chunk is stored in one row. Chunk size should be a parameter exposed to the user.

This new transform should be added along with other language modules here https://github.com/IBM/data-prep-kit/tree/dev/transforms/language

Are you willing to submit a PR?

dolfim-ibm commented 3 months ago

Done in https://github.com/IBM/data-prep-kit/pull/461