NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

Hardcode labels for domain and quality classifiers #122

Closed sarahyurick closed 4 days ago

sarahyurick commented 1 week ago

Closes #71

Copy of https://github.com/NVIDIA/NeMo-Curator/pull/95 which I had to close due to GitHub issues.

cc @VibhuJawa this should be ready for review.

sarahyurick commented 4 days ago

Closing in favor of the work for https://github.com/NVIDIA/NeMo-Curator/issues/72.

We should be able to get labels from HF directly (https://huggingface.co/nvidia/domain-classifier) as they are present in config.id2label.