acronym: add leading / trailing and multiple separator case

yawpitch commented 5 years ago

Currently the acronym tests do not cover inputs with leading, trailing, or repeated separator characters, and many solutions presented will fail if these are encountered.

A student suggested this as a good test string: " - Annoying string ending - with - multiple separators - " should return "ASEWMS".

rpottsoh commented 5 years ago

I think a single PR that closes #1431 and #1432 would suffice.

yawpitch commented 5 years ago

Definitely. I'm just not in a position to provide it right now.

Also I think we should consider what should the acronym be for inputs like "3 Men And An _nderscore"?

Basically we should positively state what we consider an acronym or if we don't want to worry about those sort of inputs affirmatively say they won't be provided. On Jan 5, 2019, 16:28 +0000, Ryan Potts notifications@github.com, wrote:

I think a single PR that closes #1431 and #1432 would suffice. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

rpottsoh commented 5 years ago

Also I think we should consider what should the acronym be for inputs like "3 Men And An _nderscore"?

Interesting. I was curious so I tried this with my solution; I get MAAN.

Basically we should positively state what we consider an acronym or if we don't want to worry about those sort of inputs affirmatively say they won't be provided.

The description is a little weak and could probably benefit from a definition of some sort that defines what is considered a reasonable phrase from which an acronym could be derived.

yawpitch commented 5 years ago

Definitely. Personally I'd say something like: "A valid input will be an all-ASCII word or phrase, possibly containing punctuation, and possibly empty. For the purposes of this excercise you can expect that any word given will begin with an ASCII letter, but may be in any case. Hyphenated words are considered distinct words, for instance 'Self-Contained Underwater Breathing Apparatus' becomes 'SCUBA'. All other punctuation should be ignored, and an empty string or string without any words should return an empty string."

Does that map to all the languages that have implemented acronym though?

M On Jan 5, 2019, 17:05 +0000, Ryan Potts notifications@github.com, wrote:

Also I think we should consider what should the acronym be for inputs like "3 Men And An _nderscore"? Interesting. I was curious so I tried this with my solution; I get MAAN. Basically we should positively state what we consider an acronym or if we don't want to worry about those sort of inputs affirmatively say they won't be provided. The description is a little weak and could probably benefit from a definition of some sort that defines what is considered a reasonable phrase from which an acronym could be derived. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

rpottsoh commented 5 years ago

Does that map to all the languages that have implemented acronym though?

Don't know, but likely.

I think your statement sums things up nicely though. 👍

ErikSchierboom commented 5 years ago

I think a more specific description of what defines an acronym would be much appreciated.

sshine commented 5 years ago

It's the first letter of each word. For hyphenated words, include the letter after the hyphen(s).

Why does it need to be more specific than this?

yawpitch commented 5 years ago

For one that's not the meaning of "acronym" -- the exercise name -- and certainly not "abbreviate" -- the name of the actual property under test in the exercise -- in all languages and locales. In fact it's really only the meaning in American and British English, though it's used more or less the same in a few other territories like Russia (in Cyrrilic) and Vietnam. But fair enough, since we're already assuming ASCII let's assume American initialism "rules" apply... what happens with non-letter characters that start words? The generally accepted "rules" are silent on this, but there are certainly initialisms with numbers (HTML5, CSS3, 3G). For instance many solutions in Python employ use a regex with the \w special character, which in Python 3 allows not only digits and the underscore, but also any Unicode code points that could be part of a word in any locale. Potentially that means acronyms can include Kanji. Should the student be required to limit it to ASCII?

And what constitutes a valid separator pattern? Is it just spaces and hyphens immediately preceding a letter? Or is it any run of punctuation except a single conjoining apostrophe?

The tests are few and not particularly exhaustive and the problem is loosely defined... it's already lead to more wheel spinning than it desterves because it's not more clearly delineated. But if that definition is "the first ASCII letter of each word that's preceded by the start of the sentence, a single space, or a single hyphen" as implied by the tests, that's fine, we just need to state it clearly. On Jan 10, 2019, 20:58 +0000, Simon Shine notifications@github.com, wrote:

It's the first letter of each word. For hyphenated words, include the letter after the hyphen(s). Why does it need to be more specific than this? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

sshine commented 5 years ago

Ok, let's state it clearly then.

rpottsoh commented 5 years ago

Realize that #1436 has been merged recently, deals with underscores. I am gathering that this issue is maybe more for advocating a change to the description.md than to the canonical data.

sshine commented 5 years ago

The assumption of ASCII is not unique to this exercise. So "The first letter of each word" should be sufficient here.

sshine commented 5 years ago

this issue is maybe more for advocating a change to the description.md

At least a part of the discussion has focused on that.

yawpitch commented 5 years ago

The assumption of ASCII is not unique to this exercise. So "The first letter of each word" should be sufficient here.

I'd tend to argue that that assumption is a bug, not a feature of Exercism, and that where it's relevant to the solution the bias should be explicitly called out.

As acronym is an exercise that will very commonly be approached with regular expressions -- in Python it's a core exercise and tagged as the first to involve regex -- the ASCII limitation can be very important to the solution.

For instance in Python 3 without complying the regex with the re.ASCII flag the \w special character will match all of E and È and É and Ę... should those all be included? Should they be excluded? I don't know or have a particular opinion, but us not expressing a preference for ASCII-only solutions leaves it as UB, and UB is pretty confusing for a learner, especially one for whom English isn't a first language and who isn't necessarily typing in ASCII.

If we explicitly limit the character set we make the student's lives easier, and we also get the opportunity to present a bonus exercise in which they extend to handle something like "L'École Française du Bristol", which according to that school should abbreviate to EFB.

emcoding commented 5 years ago

exercism / problem-specifications

acronym: add leading / trailing and multiple separator case #1432