Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Can you share your filtered Pile (180Gb) dataset?
The paper mentions only truncation as preprocessing, can you provide more details about your filtering step? Also did you use specific subsets of Pile (Pile-CC, Wiki, Arxiv...)?
Hello,
Can you share your filtered Pile (180Gb) dataset? The paper mentions only truncation as preprocessing, can you provide more details about your filtering step? Also did you use specific subsets of Pile (Pile-CC, Wiki, Arxiv...)?
Thanks in advance