EleutherAI / the-pile

MIT License
1.44k stars 122 forks source link

Suggested corpus: Adult stories #107

Open johnflux opened 1 year ago

johnflux commented 1 year ago

I have corpus of ~10GB of adult stories, in English, in plain text, taken primarily from asstr.org and literotica. I think it would be interesting to incorporate these into the training set as well.

dboggs95 commented 1 year ago

@johnflux I would look in the Pile paper, page 22, excluded datasets. https://arxiv.org/abs/2101.00027 https://arxiv.org/pdf/2101.00027.pdf

One of your datasources is directly named and excluded there, and the other one, probably follows the same rationale. Their reasons for excluding these were much different from the reasons for which I would have excluded them were it my choice (my rationale is x in, x out -> where x = {copyright infringement, nsfw content}), but they had a more scientific rationale you can read there.