Open lizadaly opened 3 years ago
Nice idea and execution. I had never heard of this character before. Love the symmetrical punctuation at the end of your 'novel'!
You may already know about this, but there is a scanned copy of a late (1884) edition of the original at the Internet Archive: https://archive.org/details/pickleforknowing00dextrich for anyone else who want to see the original. The punctuation page is here: https://archive.org/details/pickleforknowing00dextrich/page/31/mode/1up
Every edition seems to do the punctuation page differently! Mine looks like this, which is why I went through all the trouble of implementing the stupid symmetry:
@lizadaly, your's looks better, the symmetry was worth it :)
A translator and generator to produce text in the style of A Pickle for the Knowing Ones (1802) by noted eccentric Timothy Dexter (1747-1806).
Timothy Dexter was a merchant and "entrepreneur" who married rich and had an unlikely providential career selling literal coals to Newcastle and other dubious business ventures.
"A Pickle for the Knowing Ones" is unique in its use of idiosyncratic and self-aggrandizing language. Dexter wrote phonetically, with no punctuation, in turn praising himself, castigating his enemies, railing at his creditors and debtors, and insulting his wife. The local community of Newburyport continued to reproduce his initial pamphlet for decades after his death.
In the second and subsequent editions of his pamphlet he responded to criticisms that he failed to use punctuation with the following epigraph:
Project goals
Transform a modern text with ideological similarity to Dexter's business initiatives to a Dexter-like rendering.
The full source code repository contains the following utilities:
A dictionary generator, to iterate through the source of A Pickle for the Knowing Ones and generate a map of Dexter's spellings to standard English. The program should suggest words, skip known words, and store the dictionary.
A script, pickler.py, to take an input text and turn it into a Dexter-style rendering including punctuation treatment, appended to the end.
A programmatic downloader for the source data, filings.py to download SEC quarterly earnings reports from Telsa, Inc. and pass the source text into
pickler.py
.Dictionary generator
This loops through the source text, breaking on word boundaries, and generates an ordered data structure like the following:
Initially the first two values are the same, and the last is false.
In the spellcheck pass, a text-based UI assists the transcriber by showing a window of context, with a selection of spell-checked options (via autocomplete), or the transcriber can type in a new word:
Any words added replace the
spellchecked_word
value in the dictionary, and flip theis_spellchecked
bit. Re-running the program will resume at the last-checked word. On control-C (or at completion), the dictionary is saved.(Several manual passes were later made over the dictionary both to correct OCR errors in the Gutenberg-derived scan, but also to add additional Dexter errors that were accidentally corrected by the OCR software, based on comparing with my original edition.)
The Pickler
This takes an input text, as a list of strings, remaps all words according to the dictionary generated above, and removes all punctuation. The punctuation is then appended to the end, as Dexter did.
A test suite generates the process on a few samples:
Source:
Output:
filings.py
This downloads recent quarterly earnings reports (10-Q filings) from the EDGAR database provided by the US Securities and Exchange Commission using sec-edgar-downloader, parses the reports, then passes the output through the pickler.
EDGAR reports are in a bespoke SGML format with wrapped HTML; this extracts the HTML blob, passes the text nodes to
pickler.py
, updates them in replace, then writes out the transformed HTML.Using the HTML-to-text capability of w3m then produces nicely-formatted plain text.
The output
66,369 words derived from three years of Tesla quarterly reports and amendments. Examples:
and in closing, all punctuation extracted from the source text:
Full 66,579 "novel": output.txt
More project detail, especially about the source editions used, in the repository README.