Corpus Builder is a tool which consumes a dialog dataset built with gutenberg-dialog and identifies narrative segments from the source books which align with each dialog. This tool will output a dialog-narrative aligned corpus used for training the model.
Input:dialogs.txt output from gutenberg-dialog.
Example dialog sequence from dialogs.txt:
93.txt: ...
93.txt: Look!--what's that?
93.txt: Don't! Don't take a person by surprise that way. I'm 'most ready to die, anyway, without you doing that.
93.txt: Look, I tell you. It's something coming out of the sycamores.
93.txt: Don't, Tom!
93.txt: It's terrible tall!
93.txt: Oh, lordy-lordy! let's--
93.txt: Keep still--it's a-coming this way.
93.txt: ...
Note there is a blank line before and after the dialog sequence. Blank lines separate different conversations.
Output: new file dialogs_narratives.txt containing the surrounding narratives for each dialog, and a new indicator [D] meaning this line is a dialog and [N] meaning this line is a narrative.
Example dialog-narrative aligned sequence in dialogs_narratives.txt:
93.txt: ...
93.txt [N]: We laid down, kind of weak and sick, and listened for more sounds, but didn't hear none for a good while but just >our hearts. We was thinking of that awful thing laying yonder in the sycamores, and it seemed like being that close to a ghost, >and it give me the cold shudders. The moon come a-swelling up out of the ground, now, powerful big and round and bright, >behind a comb of trees, like a face looking through prison bars, and the black shadders and white places begun to creep >around, and it was miserable quiet and still and night-breezy and graveyardy and scary. All of a sudden Tom whispers:
93.txt [D]: Look!--what's that?
93.txt [D]: Don't! Don't take a person by surprise that way. I'm 'most ready to die, anyway, without you doing that.
93.txt [D]: Look, I tell you. It's something coming out of the sycamores.
93.txt [D]: Don't, Tom!
93.txt [D]: It's terrible tall!
93.txt [D]: Oh, lordy-lordy! let's--
93.txt [D]: Keep still--it's a-coming this way.
93.txt [N]: He was so excited he could hardly get breath enough to whisper. I had to look. I couldn't help it. So now we was >both on our knees with our chins on a fence rail and gazing--yes, and gasping too. It was coming down the road--coming in >the shadder of the trees, and you couldn't see it good; not till it was pretty close to us; then it stepped into a bright splotch of >moonlight and we sunk right down in our tracks--it was Jake Dunlap's ghost! That was what we said to ourselves.
Feature: Corpus Builder (first iteration)
Corpus Builder is a tool which consumes a dialog dataset built with gutenberg-dialog and identifies narrative segments from the source books which align with each dialog. This tool will output a dialog-narrative aligned corpus used for training the model.
Input:
dialogs.txt
output from gutenberg-dialog.Example dialog sequence from dialogs.txt:
Note there is a blank line before and after the dialog sequence. Blank lines separate different conversations.
Output: new file
dialogs_narratives.txt
containing the surrounding narratives for each dialog, and a new indicator [D] meaning this line is a dialog and [N] meaning this line is a narrative.Example dialog-narrative aligned sequence in
dialogs_narratives.txt
: