Closed Insti closed 7 years ago
Aye!
I believe that the same logic can be applied to a broader class of questions:
Q: Should we test for ...? A: No! Unless that is a fundamental part of the problem.
Anytime we testing for something that is nothing fundamental, we reduce diversity in the solutions and it gets boring to review code. :grin:
Agreed. I think having the non-ASCII test in Isogram does not really add anything. In fact, I think most people will be confused by it. If someone could come up with another exercise in which non-ASCII character handling does make sense, I'm all for it. But for the current exercises, I think it should be removed.
I do like to have some of non US-ASCII test cases as an optional part for a very small subset of exercises. They can teach you a lot of unicode-handling if and only if you are open to it and want to learn it.
Anagram is not one of them, since it requires you to normalize charakters and some of them can't be normalized. In the example of the german ß
vs. SS
, is MASSE
an anagram of Maße
? How is it the other way round?
Word-Count on the other hand does gain a lot by adding the Unicode sugar to separate words from each other. But, well, the normalisation problem persists, but its influence is by far not that strong.
I would agree that non-ASCII test cases don't make sense in every exercise, especially if they add unnecessary complexity for example if normalization is needed. But we should add unicode character where ever it makes sense. It is 2016, emojis have conquered the world and not everyone can stay in the US bubble. People need to learn how to deal with unicode at some point in the tracks.
It is 2016, emojis have conquered the world and not everyone can stay in the US bubble. People need to learn how to deal with unicode at some point in the tracks.
I agree, this is why there need to be specific exercises that deal with multi-language text handling, and all the gotchas and edge cases involved in that. But that shouldn't be required in an exercise that is about the algorithm for detecting anagrams.
Looks like we're going sort of Unix philosophy with many exercises - do one thing and do it well. This is why we are cutting Unicode from a few text exercises. I welcome the move!
For the time when that new Unicode exercise(s) solidifes, I have a list of things I've seen over the months that could stand to go in it:
Are there other problems that have non-ascii test cases?
We can try git grep -Pnl "[\x80-\xFF]"
to find them, though I admit this can have false positives. But at least it shouldn't have false negatives... right?
Edit: Actually, it may not have false positives either...
For me, this has found:
exercises/atbash-cipher/canonical-data.json exercises/bob/canonical-data.json exercises/forth/canonical-data.json exercises/isogram/canonical-data.json exercises/pangram/canonical-data.json exercises/run-length-encoding/canonical-data.json exercises/scrabble-score/canonical-data.json
I have been working in power generation for the last few years, and the frequency that unicode characters (non ASCII) breaks existing code increases every year. Data collection spreads across the globe and into countries that don't use ASCII contained alphabets. Many common shortcuts to manage punctuation are bad practice outside of case sensitive alphabets. I understand KISS, but I can't think of a better place then here in these learning exercises to help people move to thinking in Unicode, and away from using ASCII crutches. Another example, of these crutches is that not all languages have the concept of upper and lower case (bicameral vs. unicameral alphabets, think Persian, Arabic, Hebrew). I think the test cases should challenge one to handle case, but also handle a situation where the alphabet does not have a case. Anagrams and isograms exist in these languages as well.
@ldwoolley what you say here is obviously true. Thats exactly the point why we said, that we need extra exercises that teach unicode.
BUT! We need to slow that down a bit. In most languages everything not-US-ASCII is a PITA. Most languages require you to use even external libraries.
So removing non-US-ASCII achieves multiple goals:
But would it hurt to have the non-ASCII test cases as part of the test suits only that they are deactivated/skipped with a comment like "if you are not new to programming and/or care about unicode it might be interesting to thing about ..."?
Conclusion: All test cases should only use ASCII characters (Unless extended character handling is integral to the problem.)
We should add exercises that explicitly deal with multi-language characters: See https://github.com/exercism/x-common/issues/455
@Insti this discussion should now be closed right?
The canonical-data.json files have also now been updated via #441.
While updating the
anagram
test cases (issue: https://github.com/exercism/x-common/issues/413) Discussion of handling non-ascii characters came up, and we decided that we would NOT use non-ascii characters in those tests.@NobbZ made a good point:
Isogram (as of 20161031) also has non-ascii test cases.
Are there other problems that have non-ascii test cases?
I've created this issue so we can discuss the general policy of whether non-ascii characters should be used in test cases and have a a thread to point to when it comes up again in the future.
Proposal: All test cases should only use ASCII characters (Unless extended character handling is integral to the problem.)