IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
237 stars 122 forks source link

[Feature] Add license filtering for code modules #139

Open Bytes-Explorer opened 5 months ago

Bytes-Explorer commented 5 months ago

Search before asking

Component

Transforms/code/code_quality

Feature

Capability to filter by permissive licenses for any new code data as a new module.

Are you willing to submit a PR?

Bytes-Explorer commented 4 months ago

Needed for release 1

daw3rd commented 4 months ago

@Bytes-Explorer i caved on proglang_select, but now we're repeating this pattern for a different column name? We can't have a new transform for every attribute we want to annotate/filter on. I recommend 1 of the following

  1. Extend filter transform so that instead of removing rows, it optionally allows, keeping all rows, but adds a new column that is set to True for what "would have been filtered" cc: @cmadam
  2. Create a new transform that generalizes what proglang_select does for an arbitrary column and set of values.

I would strongly recommend one of these alternatives to creating a new special purpose transform. We're a toolkit after all.

Bytes-Explorer commented 4 months ago

These are two separate functionalities and we need to think from a view of how an end user will use it. There can be many transforms that will use filtering at the end to filter. However coupling all the functionalities together may make it hard for an end user to use the toolkit, esp when the number of functionalities grow.

daw3rd commented 1 month ago

This is in progress in PR #257