Open jdherg opened 9 years ago
Heh, did you know some 19th century writers, Austen included, did actually redact names and dates?
Good catch! I ran into a similar question on MetaFilter a little while ago and I think it planted the seed for this script. I nearly used '-' instead of 'x' as my replacement character as a reference to that practice. Thanks for the reminder!
See also https://github.com/dariusk/NaNoGenMo-2014/issues/108 that uses a different method of redacting.
I wrote a quick Python script that tries to "redact" a novel by obscuring names.
I initially used NLTK to identify words to redact, but wanted to see how close I could get to that result without using complex models or anything other than the standard library. As a result, the script now redacts tokens that:
Like some actual document redaction, the results are a little inconsistent. For example, it does a glaringly bad job with names that are also common nouns that appear elsewhere in the text.
The repo is here and a redacted copy of Pride and Prejudice is here.