Discussion: Non-Ascii test cases.

Insti commented 7 years ago

While updating the anagram test cases (issue: https://github.com/exercism/x-common/issues/413) Discussion of handling non-ascii characters came up, and we decided that we would NOT use non-ascii characters in those tests.

@NobbZ made a good point:

I do not think, that we should test for anything that is not in ASCII.

Most languages do not cope very well with codepoints beyond ASCII

There are letters beyond ASCII, that would need another step of normalization. Eg the german "es-zett" (ß) which is is only allowed as lowercase (but there is a capital version available which is allowed to be used on titlepages and headlines), on capitalisation it is usually turned into "SS". There has been the long-s, and similar characters in ancient greek (gamma was one of them AFAIK).

Isogram (as of 20161031) also has non-ascii test cases.

Are there other problems that have non-ascii test cases?

I've created this issue so we can discuss the general policy of whether non-ascii characters should be used in test cases and have a a thread to point to when it comes up again in the future.

Proposal: All test cases should only use ASCII characters (Unless extended character handling is integral to the problem.)

rbasso commented 7 years ago

Aye!

I believe that the same logic can be applied to a broader class of questions:

Q: Should we test for ...? A: No! Unless that is a fundamental part of the problem.

Anytime we testing for something that is nothing fundamental, we reduce diversity in the solutions and it gets boring to review code. :grin:

ErikSchierboom commented 7 years ago

Agreed. I think having the non-ASCII test in Isogram does not really add anything. In fact, I think most people will be confused by it. If someone could come up with another exercise in which non-ASCII character handling does make sense, I'm all for it. But for the current exercises, I think it should be removed.

NobbZ commented 7 years ago

I do like to have some of non US-ASCII test cases as an optional part for a very small subset of exercises. They can teach you a lot of unicode-handling if and only if you are open to it and want to learn it.

Anagram is not one of them, since it requires you to normalize charakters and some of them can't be normalized. In the example of the german ß vs. SS, is MASSE an anagram of Maße? How is it the other way round?

Word-Count on the other hand does gain a lot by adding the Unicode sugar to separate words from each other. But, well, the normalisation problem persists, but its influence is by far not that strong.

behrtam commented 7 years ago

I would agree that non-ASCII test cases don't make sense in every exercise, especially if they add unnecessary complexity for example if normalization is needed. But we should add unicode character where ever it makes sense. It is 2016, emojis have conquered the world and not everyone can stay in the US bubble. People need to learn how to deal with unicode at some point in the tracks.

Insti commented 7 years ago

It is 2016, emojis have conquered the world and not everyone can stay in the US bubble. People need to learn how to deal with unicode at some point in the tracks.

I agree, this is why there need to be specific exercises that deal with multi-language text handling, and all the gotchas and edge cases involved in that. But that shouldn't be required in an exercise that is about the algorithm for detecting anagrams.

petertseng commented 7 years ago

Looks like we're going sort of Unix philosophy with many exercises - do one thing and do it well. This is why we are cutting Unicode from a few text exercises. I welcome the move!

For the time when that new Unicode exercise(s) solidifes, I have a list of things I've seen over the months that could stand to go in it:

petertseng commented 7 years ago

Are there other problems that have non-ascii test cases?

We can try git grep -Pnl "[\x80-\xFF]" to find them, ~~though I admit this can have false positives~~. But at least it shouldn't have false negatives... right?

Edit: Actually, it may not have false positives either...

For me, this has found:

exercises/atbash-cipher/canonical-data.json exercises/bob/canonical-data.json exercises/forth/canonical-data.json exercises/isogram/canonical-data.json exercises/pangram/canonical-data.json exercises/run-length-encoding/canonical-data.json exercises/scrabble-score/canonical-data.json

ldwoolley commented 7 years ago

I have been working in power generation for the last few years, and the frequency that unicode characters (non ASCII) breaks existing code increases every year. Data collection spreads across the globe and into countries that don't use ASCII contained alphabets. Many common shortcuts to manage punctuation are bad practice outside of case sensitive alphabets. I understand KISS, but I can't think of a better place then here in these learning exercises to help people move to thinking in Unicode, and away from using ASCII crutches. Another example, of these crutches is that not all languages have the concept of upper and lower case (bicameral vs. unicameral alphabets, think Persian, Arabic, Hebrew). I think the test cases should challenge one to handle case, but also handle a situation where the alphabet does not have a case. Anagrams and isograms exist in these languages as well.

NobbZ commented 7 years ago

@ldwoolley what you say here is obviously true. Thats exactly the point why we said, that we need extra exercises that teach unicode.

BUT! We need to slow that down a bit. In most languages everything not-US-ASCII is a PITA. Most languages require you to use even external libraries.

So removing non-US-ASCII achieves multiple goals:

Reduce complexity of exercises
Make exercises more focused
Remove dependencies from exercises
Make it easier to have uniform test-data across different tracks.

behrtam commented 7 years ago

But would it hurt to have the non-ASCII test cases as part of the test suits only that they are deactivated/skipped with a comment like "if you are not new to programming and/or care about unicode it might be interesting to thing about ..."?

Insti commented 7 years ago

Conclusion: All test cases should only use ASCII characters (Unless extended character handling is integral to the problem.)

We should add exercises that explicitly deal with multi-language characters: See https://github.com/exercism/x-common/issues/455

robkeim commented 7 years ago

@Insti this discussion should now be closed right?

The canonical-data.json files have also now been updated via #441.

exercism / problem-specifications

Discussion: Non-Ascii test cases. #428