Counting in Context by @hugovk

hugovk commented 9 years ago

Almost forgot about this. Code and output was created before the deadline, PDF knocked up and all uploaded afterwards.

What happens if we want to find each sequential number, in words, in a big corpus?

This is what happens.

PDF | HTML | MD

It uses the Project Gutenberg CD of 600 books, containing some 3,583,389 sentences.

It runs through twice: first with the first sentence found in the corpus (from zero to fifty-five thousand); second with the shortest matching sentence (zero to forty-eight thousand).

Made something like this:

gutencounter --cache *.txt >> gutencounter-unsorted.md
gutencounter --sort --cache *.txt >> gutencounter-sorted.md
[leave running until have enough words]
cat gutencounter-unsorted.md > gutencounter.md
cat gutencounter-sorted.md >> gutencounter.md
grep "##" gutencounter.md > contents.txt
[hack contents.txt into links]
cat gutencounter.py >> gutencounter.md
wc -w gutencounter.md
[hack front matter and contents into gutencounter.md and <pre></pre> for source]
multimarkdown gutencounter.md > gutencounter.html

Then print to PDF using Chrome. Big thanks to @moonmilk for the CSS.

Source: https://github.com/hugovk/gutengrep/blob/gh-pages/gutencounter.py

MichaelPaulukonis commented 9 years ago

I would like to see the numerical sentences closer together, without chapter headings.

Perhaps the numbers could be in bold?

It's too broken up. For me. The layout persists each sentence is in its original isolation. Pushing them together would allow us to see them together. As your algorithm suggests.

hugovk commented 9 years ago

Good points, both.

I'd intended to do the bold, but never got round to it. In fact, there's an not-done TODO for that :)

# s = s.replace(args.word, "**" + args.word + "**")  TODO

...

#     parser.add_argument('-b', '--bold', action='store_true',
#                         help="Embolden found text TODO")

I probably won't redo it with bold, as it'd mean re-running lots of slow code. Or messing around with regexes.

About the grouping, I just re-used the same CSS, but it could be tweaked easily and re-run quickly, so I might do that.

hugovk commented 9 years ago

I've done the easy bit and smushed the chapters closer together, which has also decreased the page count from 1,609 to 358.

dariusk / NaNoGenMo-2014

Counting in Context by @hugovk #149