OFA-CN pre-train dataset

OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Apache License 2.0

2.39k stars 248 forks source link

OFA-CN pre-train dataset #351

Closed jingshuangliu22 closed 1 year ago

jingshuangliu22 commented 1 year ago

Dear Author, I am trying to pre-train the Chinese OFA model from scratch. Could you please tell me what dataset you used to pre-train the OFA-CN model? Is the Chinese dataset available? Looking forward to your answer. Best wishes.

JustinLin610 commented 1 year ago

We use very similar multimodal dataset (mostly publicly available) to that in our recent work Chinese CLIP https://arxiv.org/abs/2211.01335. Use it for reference. For the plaintext, we use the internal dataset from M6 https://arxiv.org/abs/2103.00823, and thus I advise you to use other plain text datasets of similar sizes as alternatives.