Suggested corpus: Adult stories

@johnflux I would look in the Pile paper, page 22, excluded datasets. https://arxiv.org/abs/2101.00027 https://arxiv.org/pdf/2101.00027.pdf

One of your datasources is directly named and excluded there, and the other one, probably follows the same rationale. Their reasons for excluding these were much different from the reasons for which I would have excluded them were it my choice (my rationale is x in, x out -> where x = {copyright infringement, nsfw content}), but they had a more scientific rationale you can read there.

EleutherAI / the-pile

Suggested corpus: Adult stories #107