Pile Dataset - Githubissues

OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Apache License 2.0

2.39k stars 248 forks source link

Pile Dataset #355

Open mshukor opened 1 year ago

mshukor commented 1 year ago

Hello,

Can you share your filtered Pile (180Gb) dataset? The paper mentions only truncation as preprocessing, can you provide more details about your filtering step? Also did you use specific subsets of Pile (Pile-CC, Wiki, Arxiv...)?

Thanks in advance

JustinLin610 commented 1 year ago

It is not available for us to release the processed data. You can try downloading from the official website.