AbrahamSanders / SIMIE

SIMIE - A SImulated MInd's Eye is an experiment in narrative extrapolation from dialog using GPT-2.
MIT License
5 stars 0 forks source link

[Dev] Corpus Builder (first iteration) #1

Closed AbrahamSanders closed 3 years ago

AbrahamSanders commented 3 years ago

Feature: Corpus Builder (first iteration)

Corpus Builder is a tool which consumes a dialog dataset built with gutenberg-dialog and identifies narrative segments from the source books which align with each dialog. This tool will output a dialog-narrative aligned corpus used for training the model.

Input: dialogs.txt output from gutenberg-dialog.

Example dialog sequence from dialogs.txt:

93.txt: ...

93.txt: Look!--what's that? 93.txt: Don't! Don't take a person by surprise that way. I'm 'most ready to die, anyway, without you doing that. 93.txt: Look, I tell you. It's something coming out of the sycamores. 93.txt: Don't, Tom! 93.txt: It's terrible tall! 93.txt: Oh, lordy-lordy! let's-- 93.txt: Keep still--it's a-coming this way.

93.txt: ...

Note there is a blank line before and after the dialog sequence. Blank lines separate different conversations.

Output: new file dialogs_narratives.txt containing the surrounding narratives for each dialog, and a new indicator [D] meaning this line is a dialog and [N] meaning this line is a narrative.

Example dialog-narrative aligned sequence in dialogs_narratives.txt:

93.txt: ...

93.txt [N]: We laid down, kind of weak and sick, and listened for more sounds, but didn't hear none for a good while but just >our hearts. We was thinking of that awful thing laying yonder in the sycamores, and it seemed like being that close to a ghost, >and it give me the cold shudders. The moon come a-swelling up out of the ground, now, powerful big and round and bright, >behind a comb of trees, like a face looking through prison bars, and the black shadders and white places begun to creep >around, and it was miserable quiet and still and night-breezy and graveyardy and scary. All of a sudden Tom whispers: 93.txt [D]: Look!--what's that? 93.txt [D]: Don't! Don't take a person by surprise that way. I'm 'most ready to die, anyway, without you doing that. 93.txt [D]: Look, I tell you. It's something coming out of the sycamores. 93.txt [D]: Don't, Tom! 93.txt [D]: It's terrible tall! 93.txt [D]: Oh, lordy-lordy! let's-- 93.txt [D]: Keep still--it's a-coming this way. 93.txt [N]: He was so excited he could hardly get breath enough to whisper. I had to look. I couldn't help it. So now we was >both on our knees with our chins on a fence rail and gazing--yes, and gasping too. It was coming down the road--coming in >the shadder of the trees, and you couldn't see it good; not till it was pretty close to us; then it stepped into a bright splotch of >moonlight and we sunk right down in our tracks--it was Jake Dunlap's ghost! That was what we said to ourselves.

93.txt: ...

AbrahamSanders commented 3 years ago

Implemented this in gutenberg-dialog fork https://github.com/AbrahamSanders/gutenberg-dialog

See the commit