IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
195 stars 114 forks source link

Develop a new module that identifies and masks large variable initializations in code files. #385

Open sapthasurendran opened 3 months ago

sapthasurendran commented 3 months ago

Search before asking

Component

Transforms/Other

Feature

Create a new module that identifies and masks large variable initializations in code files, specifically designed to improve the quality. The new module should:

  1. Identify Large Initializations: Detect variable initializations that exceed a predefined threshold of lines or characters.
  2. Mask Identified Sections: Replace the detected large initializations with a placeholder.
  3. Provide Configuration Options: Allow customization of the threshold for what constitutes a "large" initialization and the format of the masking

Are you willing to submit a PR?

sapthasurendran commented 3 months ago

Some of the code files with large initializations: 040aa9db46704349a9cb21ba6567ec97_0000007b264352-8a0d-11ee-9b9b-dae94206066d1413.2222222222222.txt 38da3361a6d74ed5a05420f1c6e1893a_000000b80e44a6-8a0b-11ee-a06e-baafe7c9df4d57398.28571428572.txt