IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
53 stars 26 forks source link

[Feature] Build a transform to remove headers from code files #63

Open Bytes-Explorer opened 2 months ago

Bytes-Explorer commented 2 months ago

Search before asking

Component

Other

Feature

Code files often have headers. These do not contain information relevant to LLMs, and may also contain PII. We want to build a new transform to remove this header information from code files. This transform should be built in such a way that it can work across 300+ programming languages. One possible way to do is that the transform takes as input as a configuration file with Programming language names and characters to used for commenting for that language. It should then identify the header information in various programming languages specified in the input configuration file and edit the files to remove the header information.

Are you willing to submit a PR?

Bytes-Explorer commented 2 weeks ago

@Param-S Can this be closed?

Param-S commented 2 weeks ago

PR is in review. We will close this issue once the PR gets merged.