Anagram add tests for graphemes

meatball133 commented 1 month ago

This has been discussed here: https://forum.exercism.org/t/unicode-testing-for-anagram-doesnt-actually-test-for-grapheme/10906

meatball133 commented 1 month ago

The tests already contain non-ASCII characters. I am pretty sure this "β" is Greek for beta (a non ascii letter). From my perspective, it could be argued that this exercise shouldn't have Unicode characters. But that discussion should have been held when the first uncicode tests were added.

meatball133 commented 1 month ago

Here is the pr for reference: https://github.com/exercism/problem-specifications/pull/2366

BethanyG commented 1 month ago

It could be argued that this exercise shouldn't have Unicode characters. But that discussion should have been held when the first uncicode tests were added.

Just because we didn't have the discussion then doesn't mean we can't have it now. I don't think those test cases should have been added, given the instructions (I know I approved that, but I shouldn't have without looking at the docs). Even with the scenario flag.

I think if we go this route, we need to change/clarify the instructions, and we also need to make sure that any test cases form valid words in the target language. For ref, the Wiki article on Anagrams.

senekor commented 1 month ago

The reasoning at the time of the original PR was that these tests are added under the unicode scenario. (forum post) Languages are free to exclude these tests and add additional instructions if the choose to include them. Example from the Rust track:

# Instructions append

The Rust track extends the possible letters to be any unicode character, not just ASCII alphabetic ones.

BethanyG commented 1 month ago

Yes - I know. 🙂 But at the time, I didn't realize that the Anagram instructions directed students specifically at ASCII. I think that directive should be removed, and the instructions made more general, or there should be explicit mention of Unicode handling. For example, reverse-string doesn't mention ASCII at all, and micro-blog gets very specific about encodings.

I think if we're going to include Unicode here, the instructions should follow one or the other of those exercise examples. I also think that we should make sure that any test cases follow the rules of Anagram formation.

ErikSchierboom commented 1 month ago

I think if we're going to include Unicode here, the instructions should follow one or the other of those exercise examples.

I think I would prefer to just not mention ASCII in the instructions, as we already have non-ascii characters.

I also think that we should make sure that any test cases follow the rules of Anagram formation.

What do you mean by this? Just that we double-check if the test cases match the instructions?

BethanyG commented 1 month ago

What do you mean by this? Just that we double-check if the test cases match the instructions?

An Anagram needs to be a valid word in a given language (capitalization non withstanding), so as the instructions are written I think all candidates passed in test cases either need to be valid words, or obviously not valid, if that makes sense.

senekor commented 1 month ago

An Anagram needs to be a valid word in a given language

I feel like the only sensible interpretation of the candidates list is that students can assume them to be valid words. Otherwise, every single solution to this exercise would have to include an actual dicitionary of the English language.

Wikipedia may have some definition of the word Anagram, but natural language processing is not the goal of this exerise, right?

BethanyG commented 1 month ago

I feel like the only sensible interpretation of the candidates list is that students can assume them to be valid words. Otherwise, every single solution to this exercise would have to include an actual dicitionary of the English language.

..which is part of the point I am trying to make. In adding in Unicode characters (or any extended ASCII for that matter) we've taken this beyond English. With very few exceptions, English doesn't include any accented characters (nor Greek!)

But that's fine! It just means when we craft test cases, we need to make sure that the candidates are valid words -- whatever the language is that the candidates are written in. Conversely, I don't know any languages that use a Euro sign within words, so "€a" (last test case) is "obviously" wrong (and the start word is a single letter followed by a non-word symbol anyways), with I think is fine.

We might want to add "assume all candidates are valid words" to the instructions as well tho, just to be safe.

And for the test case in Greek, we have ["ΒΓΑ", "ΒΓΔ", "γβα", "αβγ"] with the expected result being ["ΒΓΑ", "γβα"]. But why not "αβγ"? Because its not a valid word - it is "ABC". But I had to go look that up in Google translate to figure that out. So I think that case needs to be changed.

And for this proposed test case, we need to pick a word that's valid, and craft candidates that are also valid words in the same language.

senekor commented 1 month ago

But why not "αβγ"?

Because it's the lowercase version of ΑΒΓ.

I don't see the impact of whether the words are valid in some natural langue on the user experience, but I'm fine with reimplementing the test cases with actual words that test the same thing.

senekor commented 1 month ago

Note that the test case using €a is extremely important for correct unicode handling. Finding valid words in a natural language that contain multi-byte characters in UTF-8 and have the same number of each byte could prove difficult.

MatthijsBlom commented 1 month ago

I don't think considering languages is feasible. The strings abba and baäb might be anagrams in one 'language', but not in another: various writing systems disagree on whether two characters are the same 'letter' or not.

If Unicode tests are to be included, I think it would be wise to have the instructions explicitly define what they mean by 'letter' for the purposes of this exercise, but not consider applicability to the World's writing systems.

BethanyG commented 1 month ago

Because it's the lowercase version of ΑΒΓ.

D'OH 🤦🏽‍♀️ Nevermind, then.

Note that the test case using €a is extremely important for correct unicode handling. Finding valid words in a natural language that contain multi-byte characters in UTF-8 and have the same number of each byte could prove difficult.

Probably. In any case, I don't have the interest to find any. 🙂

If we change the instructions to remove the reference to ASCII and add in a note that the student should assume all candidate words are valid, then I am fine.

exercism / problem-specifications

Anagram add tests for graphemes #2445