triangle-man commented 3 years ago

Story description

Overview

In The Code Book (by Simon Singh), there are a number of code-breaking challenges of (roughly) increasing difficulty [1]. One of the early challanges is to break a monoalphabetic substitution cipher. This story breaks such a cipher automatically using bigram letter frequencies. Here is the challenge:

btjpxrmlxpcuvamlxicvjpibtwxvrcimlmtrpmtnmtnyvcjxcd
xvmwmbtrjjpxamtngxrjbahuqctjpxqgmrjxvcijpxymggcijp
xhbtwrqmgmaxmtnjpxhbtwrmyjpxqmvjcijpxpmtnjpmjyvcjx
jpxtjpxhbtwracutjxtmtaxymrapmtwxnmtnpbrjpcuwpjrjvc
ufgxnpblrcjpmjjpxscbtjrcipbrgcbtryxvxgccrxnmtnpbrh
txxrrlcjxctxmwmbtrjmtcjpxvjpxhbtwavbxnmgcunjcfvbtw
btjpxmrjvcgcwxvrjpxapmgnxmtrmtnjpxrccjprmexvrmtnjp
xhbtwrqmhxmtnrmbnjcjpxybrxlxtcifmfegctypcrcxdxvrpm
ggvxmnjpbryvbjbtwmtnrpcylxjpxbtjxvqvxjmjbctjpxvxci
rpmggfxagcjpxnybjpram

Note that the original retained inter-word spaces and punctuation, which made it very easy. I have made it harder by removing these

Detail

A monoalphabetic substitution cipher maps the alphabet a, b, ..., z to a permutation of itself. To encipher a plaintext, replace each letter in the plaintext by the corresponding letter from the permutation. There are 26! possible permutations of the alphabet so a brute-force attack is unfeasible.

A standard attack starts by counting the frequenies of letters in the ciphertext. In ordinary English text, e is most common letter, followed by t, then a, o, i, n, s, h, r, d, l, u... However, replacing each letter by the corresponding one in frequency order does not usually produce a readable result, although it can be sufficient to allow a pencil-and-paper continuation of the attack, looking for common two- and three-letter sequences.

In this story we assume the plaintext l_1 l_2 l_3 ... was generated according by a Markov process, where the probability of each letter depends on (and only depends on) the previous letter. Using this model we can compute the probability of the implied plaintext, given a proposed permuation, if we have available the conditional probabilities:

P(l_1 l_2 l_3 ...) = P(l_1) P(l_2 | l_1) P(l_3 | l_2) ...

These conditional probabilities are easily computed from tabulated bigram letter frequencies.

Then we search through through permutations (using a sort of steepest descent) until we find the permutation with greatest likelihood.

[1] https://simonsingh.net/cryptography/cipher-challenge/the-ciphertexts/.

Which datasets will you be using in this Turing Data Story?

I'm not sure which ciphertext to use: the one from The Code Book or a made-up one. I may invent one. Plaintext is likely to be a snippet from a recognisable source, e.g., Shakespeare, or a paper by Turing, or something.

Tabulated single-letter and bigram counts are available from:

@article{jones2004, author = {Jones, Michael N. and Mewhort, D. J. K.}, doi = {10.3758/BF03195586}, journal = {Behavior Research Methods, Instruments, \& Computers}, number = {3}, pages = {388--396}, title = {Case-sensitive letter and bigram frequency counts from large-scale English corpora}, url = {https://doi.org/10.3758/BF03195586}, volume = {36}, year = {2004}}

Ethical guideline

Ideally a Turing Data Story has these properties and follows the 5 safes framework.

[ ] The analysis you produce is openly available and reproducible.
[ ] Any data used are open and have an explicit licence, provenance and attribution.
[ ] Any data used are not personal data (i.e. the data is anonymous or anonymised).
[ ] Any linkage of datasets in your data story does not lead to an increased risk of the personal identification of individuals.
[ ] The Story must be truthful and clear about any limitations of analysis (and potential biases in data).
[ ] The Story will not lead to negative social outcomes, such as (but not limited to) increasing discrimination or injustice.

Current status

[ ] Scoping of the story
[ ] Write story outline
[ ] Add material to the story notebook
[ ] Combine materials into a readable story
[ ] Proofread
[ ] Request reviews
[ ] Address reviews
[ ] Merge to master branch.

Updates

crangelsmith commented 3 years ago

Hi @triangle-man sorry for the late reply!

We’ve discuss this today and we think there might be a couple of ways to include R to the TDS pipeline. One is to have two files, one is an R script that we will use with Binder to satisfy the reproducibility side of the Data Story and html with text + code that we plug into the fast pages.

Another way (and probably easier for us and the TDS framework) is to use jupyter with R (see this example). That should be almost straightforward.

On the 14th of September (10am), we will be running a workshop on SeptembRSE conference where we hope to help people write their stories. Would you be interested in joining? I think there will be enough expertise that day to make sure we get the R integration sorted.

triangle-man commented 3 years ago

Sure, love to join!

Yes; on reflection you are quite right: R-in-Jupyter is the way to go. I'd forgotten that Jupyter is totally fine with R (it's Julia-Python-R right?!). I'm just grumpy about the JSON format of notebooks ...

alan-turing-institute / TuringDataStories

Breaking a (simple) cipher #159