IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
307 stars 134 forks source link

[Feature] Modify pdf2parquet to accept a parquet file with the payload in the content column #792

Open touma-I opened 1 week ago

touma-I commented 1 week ago

Search before asking

Component

Other

Feature

Extend the pdf2parquet transform to take a parquet table as input with payload (pdf/html/etc) in the content column.

Are you willing to submit a PR?