Simple Naive Redactor - Githubissues

jdherg commented 9 years ago

I wrote a quick Python script that tries to "redact" a novel by obscuring names.

I initially used NLTK to identify words to redact, but wanted to see how close I could get to that result without using complex models or anything other than the standard library. As a result, the script now redacts tokens that:

Are longer than two characters
Contain only letters
Never start with a lowercase letter anywhere in the text

Like some actual document redaction, the results are a little inconsistent. For example, it does a glaringly bad job with names that are also common nouns that appear elsewhere in the text.

The repo is here and a redacted copy of Pride and Prejudice is here.

hugovk commented 9 years ago

Heh, did you know some 19th century writers, Austen included, did actually redact names and dates?

jdherg commented 9 years ago

Good catch! I ran into a similar question on MetaFilter a little while ago and I think it planted the seed for this script. I nearly used '-' instead of 'x' as my replacement character as a reference to that practice. Thanks for the reminder!

hugovk commented 9 years ago

See also https://github.com/dariusk/NaNoGenMo-2014/issues/108 that uses a different method of redacting.

dariusk / NaNoGenMo-2014

Simple Naive Redactor #141