NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 32 forks source link

[IMP] Hard code labels for the quality and domain classifier #71

Open VibhuJawa opened 1 month ago

VibhuJawa commented 1 month ago

Is your feature request related to a problem? Please describe.

We should prevent feeding in labels to DomainClassifier, Quality Classifier at every initialization of these classifiers.

@ryantwolf , mentioned that the labels are fixed for each classifier right, as in they can't be reordered or altered on-demand and suggested that we should probably refactor this so that the labels are just hardcoded into the DomainClassifier and QualityClassifier so the user doesn't have to manually enter in the labels each time. It would then be good to also add in the docstring of the constructor for each class what the possible options for the filter_by parameter would be.

See link: https://github.com/NVIDIA/NeMo-Curator/pull/58#discussion_r1605337411