exercism / problem-specifications

Shared metadata for exercism exercises.
MIT License
327 stars 541 forks source link

crypto-square implementations disagree with README in many tracks #190

Closed matthewmorgan closed 8 years ago

matthewmorgan commented 8 years ago

Taken from xjava:

Chunk sizes are wrong: The README gives several examples of how to deal with strings of text that can be made into perfect ( ie n X n ) squares, and those that can be made into imperfect ( r X c ) squares. In all the README examples, the imperfect squares are r X c with the additional constraint that r >= c.

As documented by @jtigger here most of the tracks take a 26-letter string and chunk it in to 5, 5, 4, 4, 4, 4. This is not the behavior specified in the README, and seems like a error from early on that got propagated to many tracks.

There's no json file for this in x-common, but it appears this needs to be changed in many places, not just on the Java track.

If I'm missing something please let me know, but it seems like this is a good case for common test data.

kytrinyx commented 8 years ago

I remember having trouble with inconsistencies on this before, but I've never taken the time to go through everything and tidy it up.

I'd love to have a canonical list of test cases (crypto-square.json would be great).

kytrinyx commented 8 years ago

Pinging @exercism/track-maintainers in case you're not watching the repo :)

ErikSchierboom commented 8 years ago

Just to be clear, in the linked post it shows show variants of how tracks implement the algorithm for the strings "Madness, and then illumination." and "Vampires are people too!". What is the correct output for those strings?

petertseng commented 8 years ago

There are two fundamental choices to be made in this problem.

  1. Should the imperfect squares (which have r rows and c columns) should have r >= c or c >= r? This affects the order of letters in the output, regardless of how the output is formatted/grouped.
  2. Now that we've decided on the order of letters in the output, how should it be formatted?

I believe all tracks have reached consensus on the first question, but not on the second (but it would be good to check that). Let's examine the first question.

r >= c or c >= r?

I look at the .yml file for the exercise, it lists source of http://users.csc.calpoly.edu/~jdalbey/103/Projects/ProgrammingPractice.html, and the example on that page is a c >= r example, as it says:

so it is written into a rectangle with 7 rows and 8 columns.

This is supported by the following sentence in the README:

A message between 5 and 8 characters long should use a rectangle 3 characters wide.

So the message "abcde" fills in this way (c >= r):

abc
de

Rather than this way (r >= c):

ab
cd
e

The following portion also supports c >= r

Which has length: 30 And splits into 5 6-character rows:


On a related note, I found a contradictory sentence in the README:

Broken into 8-character columns, it yields 7 rows.

This sentence is contradictory because if a column has 8 characters, that implies there will be 8 rows, by definition. The sentence should read something similar to "Broken into 8 columns, it yields 7 rows" (this is a c >= r definition)

This would bring it in line with the example right below that sentence (the example from the page linked above, in fact!):

ifmanwas
meanttos
tayonthe
groundgo
dwouldha
vegivenu
sroots

So what are the expected outputs, you ask?

"Madness, and then illumination.", normalized, has 26 characters, so we want to decide between 5x6 and 6x5.

"Vampires are people too!", normalized, has 20 characters, so we want to decide between 4x5 and 5x4.

If we believe that it should be c >= r, the output for "Madness, and then illumination." is "msemoaanindninndlaetltshui" and the output for "Vampires are people too!" is "vrelaepemsetpaooirpo"

If we believe that it should be r >= c, then the output for "Madness, and then illumination" is "mstlnnashladaeutnnnmiediio" and the output for "Vampires are people too!" is "viaeearrotmeepopsplo"

Notice the choice we make significantly changes the order of letters in the output. Also I didn't add spaces to the output, because that's a decision to be made later.

I used the code I had written to solve this problem to generate c >= r outputs (since it was written for a track that assumes c >= r) and http://rumkin.com/tools/cipher/coltrans.php to generate the r >= c outputs - for example you enter in "vampiresarepeopletoo" and a key of "1 2 3 4 5" if you want 5 columns. Because fundamentally this crypto square is a columnar transposition but with no rearrangement of columns and column size strictly based on the size of the message. My code agreed with the output generated by that page for c >= r, so I believed the page was reliable for generating r >= c outputs.

It seems to me that the various outputs pointed out at https://github.com/exercism/exercism.io/issues/2432#issuecomment-118467422 all are consistent with c >= r. They just are spaced differently. Which brings us to the next point.

Spacing of output

As long the order of the letters is correct, I'm much less fussed about how they are spaced. In fact, "no spacing" is a perfectly valid option to me.

However, there are obvious disadvantages of choosing to space out "mstlnnashladaeutnnnmiediio" as "msemo aanin dnin ndla etlt shui" since it trivially allows recovery of the plaintext - read the first letter of every word, read the second letter of every word, read the third...

Unfortunately, such a spacing is what is currently prescribed in the readme ("Output the encoded text grouped by column.")... and is also what is used in the examples in http://users.csc.calpoly.edu/~jdalbey/103/Projects/ProgrammingPractice.html . So there is an argument for doing it this way, despite my protests. This is the chunking of 5, 5, 4, 4, 4, 4 which actually is the behavior specified in the readme (since the number of chunks equals the number of columns)

Before https://github.com/exercism/x-common/commit/89d1274875bb9b8a7076c9af6af5c728786995b2, the old readme just said always output in chunks of five (regardless of message size!) which makes it harder to recover the plaintext (unless the message size is exactly 25 characters in which case you're out of luck!). This is where the "msemo aanin dninn dlaet ltshu i" grouping comes from.

It appears some tracks also have chunking where the number of letters in each chunk (as opposed to the number of chunks) is the same as the number of columns ("msemoa anindn inndla etltsh ui"). A possible solution, sure, but there need be no relation between the number of columns and the number of letters per chunk.

As for "vrelaepemsetpaooirpo" (the output for "vampires are people too"), the grouping consistent with the current README is "vrel aepe mset paoo irpo" (again, plaintext easily recovered, and number of chunks equal to number of columns), and the grouping for "always five" is "vrela epems etpao oirpo" (which also happens to be the "chunk size equals number of columns" since the number of columns was five)


So we should make a decision on these two things and standardize, I suppose.

My personal recommendations:


Footnote: I had originally written this comment assuming the tracks were in disagreement on the first question rather than the second. So that's why I went into so much detail defending c >= r for the first question... and then read https://github.com/exercism/exercism.io/issues/2432#issuecomment-118467422 and realized they all had the same letter order, so in fact the disagreement was on the second question instead. OOPS. So I had to backtrack and also write about output spacing. Then I decided leaving the entire comment intact would be helpful for anyone else who was confused between the two issues, and to underscore their relative importance in my mind.

kytrinyx commented 8 years ago

@petertseng This is a fantastic summary of the problem, thank you so much for doing the analysis and exposé.

The origin of this exercise is the "square code" exercise from here: http://users.csc.calpoly.edu/~jdalbey/103/Projects/ProgrammingPractice.html

matthewmorgan commented 8 years ago

@petertseng thanks so much. @jtigger and I have been working on a PR for x-common to clarify the parameters for this exercise. Can we continue the discussion there?

ErikSchierboom commented 8 years ago

Wow, thanks a lot for the clarification @petertseng !

verdammelt commented 8 years ago

Total tangent here - sorry - but while we are talking about cryptosquares I wanted to share this phrase in Latin which reads the same even when encoded as a cryptosquare:

sator arepo tenet opera rotas

(It was found in Pompeii (amongst a few other places)).

IanWhitney commented 8 years ago

I believe this can be closed. the r vs c question has been covered, and there is now a canonical test file.