IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
171 stars 111 forks source link

Create new transform to ingest markdown (.md) files and convert to parquet format #364

Open bogdanscode opened 3 months ago

bogdanscode commented 3 months ago

Why are these changes needed?

Convert .md files to parquet files so that they can be processed by data prep pipeline This is the preferred input for InstructLab

Related issue number (if any).

178

daw3rd commented 3 months ago

Also, in the future can you sign your commits?

daw3rd commented 3 months ago

Oh and the code2parquet transform is in transforms/code/code2parquet