Thanks for this interesting project! I am wondering:
Could you please share the link of the pre-training dataset (i.e. wiki data)?
Your paper mentioned that the sentences in wiki data are 1.99 million and about 12.6GB, but we found that the original wiki data (90 million sentences) is about 23GB. They look inconsistent, do I misunderstand something?
Thanks for this interesting project! I am wondering: