We should support smashwords as an alternate source of books, since it can provide more modern works than those in project Gutenberg. Dialogs and narratives written in a modern style are absolutely necessary to train a model that will work well with the way people speak and write today.
smashwords.com was used as the source of the original BookCorpus dataset, built for the 2015 paper Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books.
We should support smashwords as an alternate source of books, since it can provide more modern works than those in project Gutenberg. Dialogs and narratives written in a modern style are absolutely necessary to train a model that will work well with the way people speak and write today.
This likely involves:
https://github.com/soskek/bookcorpus may be a good starting point, as it implements a crawler for smashwords.