Closed Harmedox closed 1 day ago
Hi @hamid-adebayo, I realized that counting english stop-words or common-words has already been implemented and annotated in column docq_contain_common_en_words here.
So, please consider using it by filtering out docs/rows having False value in that column. Recall that, this implementation is applied to en language only but the code is extendable to any other languages as long as their common stop words are provided.
Got it! Thanks @dangxuanhong
Search before asking
Component
Transforms/Other
Feature
There is a minimum number of stop words a document is expected to contain for it to be considered of good quality. This is used in preparing high-quality datasets like FineWeb.
Proposal:
Are you willing to submit a PR?