About pre-training wiki dataset - Githubissues

dreasysnail / POINTER

MIT License

112 stars 19 forks source link

About pre-training wiki dataset #20

Open AaronHeee opened 3 years ago

AaronHeee commented 3 years ago

Thanks for this interesting project! I am wondering:

Could you please share the link of the pre-training dataset (i.e. wiki data)?
Your paper mentioned that the sentences in wiki data are 1.99 million and about 12.6GB, but we found that the original wiki data (90 million sentences) is about 23GB. They look inconsistent, do I misunderstand something?