The script for tokenizing datasets from Huggingface currently uses a function that downloads the dataset stories dataset from the 'delphi-suite' namespace. It only downloads one split (validation) split and uploads it as the 'train' split.
[ ] Use the hugginface native dataset download function instead of 'delphi-suite/stories'-specific data downloader
[ ] Download all dataset splits, tokenize each split, upload each tokenized split
[ ] Optional: save tokenized dataset locally
@siwei-li I would ask you to review this, when I am done.
The script for tokenizing datasets from Huggingface currently uses a function that downloads the dataset stories dataset from the 'delphi-suite' namespace. It only downloads one split (validation) split and uploads it as the 'train' split.
@siwei-li I would ask you to review this, when I am done.