Develop the testing strategy for new NLP modules being added to the kit

IBM / data-prep-kit

Open source project for data preparation of LLM application builders

Apache License 2.0

120 stars 103 forks source link

The two new NLP modules; lang_id and doc_quality are being merged. I have already tested lang_id as a unit test (3 test files on a local Mac). Both these transforms are currently being tested regularly on a large cluster in the Pipelines testing by the inner repo team and we do not need a cluster testing strategy. For local testing (and inclusion in a new corresponding Notebook example), it would make sense to identify a small set of input files for which these transforms create meaningfully observable output. I will work with Hamid and Dhiraj in identifying such a set.

IBM / data-prep-kit

Develop the testing strategy for new NLP modules being added to the kit #167

Search before asking

Component

Feature

Are you willing to submit a PR?