,-_/,. .-,--. . .
' |_|/ ,-. ,-. ,-. . . '|__/ ,-. |- |- ,-. ,-.
/| | ,-| | | | | ,| | | | | |-' | ** *
`' `' `-^ ' ' `-| `' `-' `' `' `-' ' ** ** zz
/| ** ** ** **** zz
`-' *** * ***** zzzzzz
,---. .-,--. ,--,--' zz
| -' '|__/ `- | ,-, ," ,-. ,-. ," . ,-. xx **** zz
| ,-' ,| , | -- / |- ,-| | | |- | | xx *** zz
`---| `' `-' '-` | `-^ ' ' | ' `-' xx
,-.| ' ' xx oooooo oooooo
`-+' xx oo oo oo oo
. xx o oo ***** o o
,-. ,-. ,-. ,-. ,-. ,-. |- ,-. ,-. xx o o** **o o
| | |-' | | |-' | ,-| | | | | xx oo oo oo o
`-| `-' ' ' `-' ' `-^ `' `-' ' xx oooooo oooooo
,|
`'
Generate your own Harry Potter fanfiction with a pre-trained GPT-2 generative text model using huggingface's transformers.
This project has two parts: a scraper and a text generation model. The scraper fetches stories from fanfiction.net and creates text files ready for training.
The text generation bit lets you generate new fanfiction. We have pre-trained a model using the ~100 most popular HP fanfiction stories, but you can scrape a different set of stories and train your own model.
Faced with a pandemic of new wizarding disease, there wasn't much to do. The Ministry was struggling and the public had been sickened by Harry's expulsion from Hogwarts on his eleventh birthday – all things considered."Well," Dumbledore said finally as he looked at Pansy Parkinson in confusion once more; "I...
Faced with a pandemic of Death Eaters, it was hard not to be surprised that the Department for Regulation and Control (DCR) had been heavily infiltrated. Harry knew from his research on magical Britain's Ministry over its political alignment during World War Two how much influence there really were in wizarding society...
Hufflepuff lost by a hundred points to Slytherin, and the Quidditch team's seeker, Cedric Diggory took down his dragon. The game resumed soon after that as Harry made an appearance in front of everyone."Well I guess it was well deserved," Draco commented. "I don't think...
Hufflepuff won the House Cup by a landslide, and I don't think that's fair!" He said. "I know you can see it in your eyes! But what about him? You're letting him win because he was hurt?"A few of his friends looked alarmed at this statement."We should make sure...
A computer with Git, Python 3 and pip installed.
First, clone the repo:
$ git clone git@github.com:ceostroff/harry-potter-gpt2-fanfiction.git
Now, navigate to the folder and install the dependencies (we recommend setting up a virtualenv
):
$ pip3 install -r requirements.txt
You should have everything you need to run the scraper and the model.
To run the text generation script first give it permission to run:
$ chmod +x text_generation.sh
Now you can generate text using the default settings and the pre-trained model:
$ ./text_generation.sh
The script will ask you for a prompt: this is the initial text the model will use to generate a story. If you edit the file you can adjust the temperature ('randomness' of the generated text, default at 0.7
), the length and the number of stories (by default 10 blobs of 60 words each). See the original file with all the options here.
The scripts to collect the text from fanfiction stories is in the scrapers folder. The first scraper, fetchData.py
will get the links for the first 1,000 links among the highest rated Harry Potter fanfiction stories. This will write those links to a csv called HPSummary.csv
with other data about each story.
The script fetchContent.py
will get each link to the story from the csv and write its contents to a text file locally under the data folder. The files will be named by the story ID. The file sortData.py
will combine those with the needed delimeter and write them to a folder with one file to create a larger training.txt
file and the other to create a smaller test.txt
file.
You can train the model in your own computer but if you don't own a desktop with a dedicated graphics card and loads of memory it will take a long time. That's why we opted to train the model using an Ubuntu AWS instance with special machine-learning hardware.
Choose an instance type like g4dn.4xlarge
(priced at ~$1 per hour). With those specs you can expect training a model with ~100mb of data in about 7 hours. Disk size can fill up quickly, so choose at least 64gb. When the instance is running, ssh
inside and run the following commands to train the model:
$ git clone git@github.com:ceostroff/harry-potter-gpt2-fanfiction.git
$ source env/bin/activate
$ sudo apt-get update && sudo apt-get install python3-pip
And the requirements:
$ pip3 install -r requirements.txt
Install CUDA. Follow the instructions, and make sure to add CUDA to the PATH. Reboot the instance.
Now you can train the model. Navigate to this folder again and run train_model.sh
with:
./train_model.sh
It is important to set --per_device_train_batch_size=1
and same per --per_device_eval_batch_size=1
. If you increase the value there's a very high chance that the training will run out of memory.
train_model.sh
and change --output_dir
to the path of the last model checkpoint. After that, run train_model.sh
again.python run_clm.py \
--output_dir model/checkpoint-1000 \
...
rsync ubuntu@YOUR_AWS_IP:~/harry-potter-gpt2-fanfiction/model/* .
$ ./text_generation.sh
To activate the virtual environment:
$ source env/bin/activate
To update the requirements:
$ pip freeze > requirements.txt
To install the dependencies
$ pip install -r requirements.txt