AbrahamSanders / SIMIE

SIMIE - A SImulated MInd's Eye is an experiment in narrative extrapolation from dialog using GPT-2.
MIT License
5 stars 0 forks source link

[Dev] Support downloading and parsing books from smashwords.com #3

Open AbrahamSanders opened 3 years ago

AbrahamSanders commented 3 years ago

smashwords.com was used as the source of the original BookCorpus dataset, built for the 2015 paper Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books.

We should support smashwords as an alternate source of books, since it can provide more modern works than those in project Gutenberg. Dialogs and narratives written in a modern style are absolutely necessary to train a model that will work well with the way people speak and write today.

This likely involves:

  1. Implementing a downloader for smashwords.com
  2. Implementing an adapter, if necessary, to format the downloaded texts in the way that gutenberg-dialog's pipeline expects.

https://github.com/soskek/bookcorpus may be a good starting point, as it implements a crawler for smashwords.