IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
120 stars 103 forks source link

Develop the testing strategy for new NLP modules being added to the kit #167

Open shahrokhDaijavad opened 3 months ago

shahrokhDaijavad commented 3 months ago

Search before asking

Component

Transforms/Other

Feature

We are adding document quality and spoken language id NLP modules and new code modules for HAP, License filtering and PII to the kit and we need testing similar (or better!) to what was done for the initial set of code modules.

Are you willing to submit a PR?

shahrokhDaijavad commented 2 months ago

The two new NLP modules; lang_id and doc_quality are being merged. I have already tested lang_id as a unit test (3 test files on a local Mac). Both these transforms are currently being tested regularly on a large cluster in the Pipelines testing by the inner repo team and we do not need a cluster testing strategy. For local testing (and inclusion in a new corresponding Notebook example), it would make sense to identify a small set of input files for which these transforms create meaningfully observable output. I will work with Hamid and Dhiraj in identifying such a set.