IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
314 stars 135 forks source link

[Feature] Extend `doc_quality` to include stop words annotation #811

Closed Harmedox closed 1 day ago

Harmedox commented 1 week ago

Search before asking

Component

Transforms/Other

Feature

There is a minimum number of stop words a document is expected to contain for it to be considered of good quality. This is used in preparing high-quality datasets like FineWeb.

Proposal:

Are you willing to submit a PR?

dangxuanhong commented 1 day ago

Hi @hamid-adebayo, I realized that counting english stop-words or common-words has already been implemented and annotated in column docq_contain_common_en_words here.

So, please consider using it by filtering out docs/rows having False value in that column. Recall that, this implementation is applied to en language only but the code is extendable to any other languages as long as their common stop words are provided.

Harmedox commented 1 day ago

Got it! Thanks @dangxuanhong