IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
171 stars 111 forks source link

doc_chunk updates and new parameters #591

Closed dolfim-ibm closed 2 weeks ago

dolfim-ibm commented 2 weeks ago

Why are these changes needed?

Updating doc_chunk for

  1. Use the upstream docling-core library
  2. Expose the min_chunk_len parameter. Also see issue Refs https://github.com/IBM/data-prep-kit/issues/590

Related issue number (if any).

Refs https://github.com/IBM/data-prep-kit/issues/590