NaNoGenMo / 2017

National Novel Generation Month, 2017 edition.
https://nanogenmo.github.io
185 stars 7 forks source link

Pride, Prejudice by @hugovk #130

Open hugovk opened 6 years ago

hugovk commented 6 years ago

Pride, Prejudice

Generated output

What it does

The problem isn't generating over 50,000 words. The problem is existing books are too long. Pride and Prejudice is 130,000 words, Moby Dick is 215,136 words (or 215,136 meows). And we all know 50,000 is the gold standard for a novel! So how can we reduce the word count?

These tactics reduce Pride and Prejudice by about 15% to 111,000 words.

Next we work out the ratio of words we have to 50k, count how many sentences we have, and work out how many sentences we want to approach 50k and use a text summariser to chop out the dead wood.

How to do it

Run:

pip install -r requirements.txt

python reducifier.py

Example:

python reducifier.py
open
word count: 130,000
word count: 126,936 diff: 97.643%   deboilerplatify
word count: 125,438 diff: 96.491%   remove_quote_things
word count: 121,549 diff: 93.499%   deveryify
word count: 121,018 diff: 93.091%   decontractify
word count: 111,633 diff: 85.872%   dehonorify
Ratio (words/50k):   3
Number of sentences:     4588
Number to keep:      1529
word count: 54,273  diff: 41.748%   summarise

This produces output.txt before the summariser, and output2.txt after the summariser.

Works at least with macOS High Sierra with Python 3.6.3.

Example

Here's a diff of Pride and Prejudice and the first pass output.txt:

'tis a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

Source code

https://github.com/hugovk/NaNoGenMo-2017/tree/master/03-reducifier

janelleshane commented 6 years ago

Ha, this is great! 60% reduced Pride and Prejudice is still totally readable.

Too bad the summarizer took out all the damns.

henrikh commented 6 years ago

"Remove honorifics (Mr., Mrs., Miss, Dr.)" 😱 How can I then tell the "Bennet"s apart?!

alexyuriev commented 6 years ago

@janelleshane Cliff-notes are also readable.

sandes commented 6 years ago

Great

danesparza commented 6 years ago

@henrikh Agreed -- lines like this become ... odd.

:grimacing:

hugovk commented 6 years ago

@henrikh @danesparza Yep, I did realise that but unfortunately they just had to go to reduce the word count :) I should have replaced "Mrs. Bennet" with her maiden name, "Gardiner"!

bryanrasmussen commented 6 years ago

Sometimes you will see major characters referred to with a shortened version of the name after introduction. I would suggest calling Mrs. Bennet Mrs. B, Mr. Bennet Mr. B. You don't remove honorifics and reduce word count but you reduce character count.

bryanrasmussen commented 6 years ago

Actually considering the patriarchy Mr. B can just be B.

on edit: Ms can be used in place of Mrs. in modern times of course.

hugovk commented 6 years ago

@bryanrasmussen Word count is all that matters :)

bryanrasmussen commented 6 years ago

Not if your last name is Hugo, and your first Victor!

On Tue, Dec 5, 2017 at 9:22 AM, Hugo notifications@github.com wrote:

@bryanrasmussen https://github.com/bryanrasmussen Word count is all that matters :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NaNoGenMo/2017/issues/130#issuecomment-349230188, or mute the thread https://github.com/notifications/unsubscribe-auth/AATEQMWinK_SBHnu2ojlLhTpdjfhqcV6ks5s9P1AgaJpZM4Qwz2t .

hugovk commented 6 years ago

PS. Using the 't' contraction instead of 'the' makes this really hard to parse.

Only in some cases.

"...by a young man of large fortune from t'north of England;"[1]

This is just about the perfect edit.

[1] https://github.com/hugovk/NaNoGenMo-2017/blob/master/03-reducifier/output.txt#L35

:)

See https://news.ycombinator.com/item?id=15823499 for more discussion.

philsnow commented 6 years ago

@henrikh you'd have to make do with context, I suppose, but that's not all that different than the base text because only the eldest daughter is addressed by only her surname ("Miss Bennet") whereas the younger daughters are addressed with either their first or full names ("Miss Elizabeth" / "Miss Elizabeth Bennet"). I haven't read Pride and Prejudice in a while, are there any examples where the reader must discern identity (among Bennets or any other family) from context?

henrikh commented 6 years ago

@philsnow As far as I recall, Elizabeth is actually referred to as "Miss Bennet" when adressed directly by Mr Darcy and Mr Wickham -- but, of course, in those situations there would be no doubt :wink: