NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

FastTextQualityFilter model file release #121

Open simplew2011 opened 1 week ago

simplew2011 commented 1 week ago

not found FastTextQualityFilter model weight file, how to download it.

ryantwolf commented 4 days ago

Hello! We don't provide a model for you to use, but we do demonstrate how to train your own model. All you need is a low quality data source (like unfiltered Common Crawl snapshots) and a high quality data source (like Wikipedia) and you can follow this example script.