bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
74 stars 48 forks source link

Dedup exact lines training tokenizer dataset #409

Closed SaulLu closed 2 years ago

SaulLu commented 2 years ago

This scripts has been used to create training datasets for the tokenizer