delphi-suite / delphi

small language models training made easy
Apache License 2.0
9 stars 1 forks source link

fix dataset download for its tokenization #105

Closed joshuawe closed 5 months ago

joshuawe commented 6 months ago

The script for tokenizing datasets from Huggingface currently uses a function that downloads the dataset stories dataset from the 'delphi-suite' namespace. It only downloads one split (validation) split and uploads it as the 'train' split.

@siwei-li I would ask you to review this, when I am done.