exercism / problem-specifications

Shared metadata for exercism exercises.
MIT License
326 stars 541 forks source link

Add a rule to the style guide to explain abbreviations, acronyms and initialisms #1716

Closed iHiD closed 3 years ago

iHiD commented 3 years ago

_Originally posted by @wolf99 in https://github.com/exercism/problem-specifications/pull/1713#discussion_r505751243_

Please read the discussion there.

cmcaine commented 3 years ago

@wolf99 said:

Should there be a similar rule to explain abbreviations, acronyms and initialisms ?

For example, a rule that is often used in academic writing is that an initialism should have its expansion in brackets immediately following its first use and thereafter can be used freely.

I said:

I think we should have a no jargon rule. And don't use initialisms etc at all unless they are better known by the initialism (e.g. HTTP, HTML, radar) or if the text would be ridiculously verbose otherwise.

@wolf99 said:

In favour of no jargon.

However, outside of the specific language related initialism, should we assume that all users will have knowledge of what I or you might consider better known ones, even for non-native US English speakers?

Yet we would still need to use initialisms in some cases. If we, hypothetically, agree that RADAR is jargon to non-US English speakers, then we would not have to be forced to use the long form each and every time it is mentioned, especially if it might form the name of test cases or some other oft repeated part of an exercise.

thus we need some kind of middle solution (hence my suggestion 😉).

I can think of a host of these that might be used depending on the exercise and track: FIFO, NASA, ABS, GIF, DIY, SONAR, SUV, UFO, CRUD, ANSI, API, ASCII, ICANN, CMYK, HTTP, FTP, DPI, DSL, JPEG, ... these are not especially unusual in programming or jargon-y, yet I can easily say that not everyone around the world would be familiar with all them

I think most of those terms are better known by their initialisms / acronyms (first part of my rule), so there is no benefit from spelling them out. For example, if you're talking about NASA I won't really benefit from you saying the "The National Aeronautics and Space Administration (NASA)", instead you should say "The American space agency, NASA".

The only exceptions here are: ABS (could be one of several things); CRUD (jargon: avoid if possible); DPI (could be dots or deep packet inspection (jargon), just say the whole thing once or use a better term (pixel density or packet inspection)); DSL (jargon, spell it out); FIFO (jargon: prefer queue and stack to FIFO, FILO).

If there is some genuine reason to say CRUD or DSL loads (what?), then you can use the initialism with explanation under the second part of the rule I gave.

iHiD commented 3 years ago

(cc @kotp who has opinions on acronyms)

kotp commented 3 years ago

If we are aiming for beginner programmers, without regard to veteran English experience, then I do think that first use of any acronym, abbreviated word, initialism, should be clarified. For example, RADAR I only know what it means, technically, because of the maintenance I did on those systems. Otherwise, I would have only, probably, ever had only a vague idea of what it stands for. I would not expect anyone to know what it means, or even that it is in fact an abbreviation, acronym, or initialism, though I would suspect that they might have a vague idea of what it means, if not technically what it is. I don't think that is necessarily good enough. I expect that this has been around long enough and is globally something that has been around, that it could be something that could be looked up rather easily.

I don't know that I would know what API is, given how loosely that is used all the time. It should, honestly, be defined specifically, and every time, and otherwise avoided.

The things like DIY is influenced also by my experience, and I had to come to accept that it is now being used as Design It Yourself. (Like a DIY website).

It's not too hard for someone to read an abbreviation and make the intuitive "I know exactly what that means, no need to look it up." only to be confused later when it is used in a way that makes no sense to them, because while they know exactly what it means to them, we did not know that. It is better to define those things on first use. (And it is easy to skip over an explanation if it starts to confirm what we already know.)

We already deal with a lot of languages, some of them even human languages. We should avoid confusion where possible, and I think the practice of defining (or avoiding) is easy. After all, yes, DNA (deoxyribonucleic acid) is very long to write out, but don't we have tooling that makes DNA expand out to what it should? Or, if we decide to continue to use the short form, then it is just as easy to define it there.

Remember, we also have the additional confusion of different programming languages using different terms for the same thing, or same terms for different things. We have enough work in front of us, we can avoid introducing more than we need to.

kotp commented 3 years ago

(cc @kotp who has opinions on acronyms)

I have no idea what you are talking about… ;)

cmcaine commented 3 years ago

I agree with your point about defining ambiguous terms. Things like DIY or API should be avoided or defined in context.

The point of the first part of my rule is that what "radar" or "NASA" stands for may not be relevant. We can just treat them as nouns, and we don't define all nouns. Most people will know it as some kind of sensing equipment and that's probably good enough. Somewhat similarly, we don't define what a "minefield" is, we just assume that people will know or look that up.

As for DNA, I think we will be introducing more confusion by referring to it as deoxyribonucleic acid because people do not know it best by that name. The first line of the Nucleotide Count exercise in Julia reads:

Given a single stranded DNA string, compute how many times each nucleotide occurs in the string.

I think simpler language would say something like:

Count how many times each nucleotide occurs in a DNA strand.

or

Given a string representing the nucleotides in a strand of DNA, count how many times each nucleotide occurs.

I think we would only be referring to DNA long-form if we are explaining what DNA is, and I'm not sure we should be doing that. Here's an attempt at doing it quickly, it just comes off clumsy (imo):

DNA is a chemical chain with four kinds of links, called nucleotides. We can encode a chain of DNA as a string with one character per link like this: "GATTACA". Given a string like that, count how many times each nucleotide occurs in the DNA chain.

I don't think it is a great idea to explain the flavour of the problem like this.

In v3 we're going to have some reference material explaining concepts, and if there are some common acronyms that should be explained there then, sure, they could be referenced and spelled out, but I can't think of anything that would be covered there that is an acronym (except maybe HTTP or HTML which might be important enough to some languages to warrant an explainer, and, again, those are better known by their acronyms, and I think we might confuse matters confuse matters by saying hypertext transfer protocol every time).

iHiD commented 3 years ago

Please take a look at this PR for my thoughts on language around DNA 🙂 (along with this discussion. Just to give an indication of where I think these things should lie.

cmcaine commented 3 years ago

I think that's a nice introduction. I don't think it would be improved much by calling DNA by its long name.

My only minor beef with the Hamming distance intro is that it kinda implies that the Hamming distance is a biology concept that is also used elsewhere rather than an information theory concept that it is used in all sorts of places, including biology.

Here's an attempt to rewrite the nucleotide count intro to be more interesting:

Most people inherit not only their socioeconomic class from their parents, but also DNA, a set of chemical instructions that influence how their bodies are constructed! DNA is a long chain of other chemicals and the most important are the four nucleotides, Adenine, Cytosine, Guanine and Thymine. A single DNA chemical can contain billions of these four nucleotides and the order in which they occur is important! We call the order of these nucleotides in a bit of DNA a "DNA sequence".

We represent a DNA sequence as an ordered collection of these four nucleotides and a common way to do that is with a string of characters such as "ATTACA" for a DNA sequence of 6 nucleotides.

Given a string representing a DNA sequence, count how many of each nucleotide is present. If the string contains characters that aren't A, C, G, or T then it is invalid and you should throw an error.

Given a society that determines life outcomes by inheritance, agitate.

But, OTOH, I don't care a lot about what colour this bike shed is, and I think I've articulated my thoughts on this fairly well, so I think I might be done?

kotp commented 3 years ago

I agree about Hamming, it is a bit unfortunate. But it makes for good conversation when the solution also uses abbreviations such as "str" where I try to ensure that I take that as an abbreviation of "strand" rather than "string" (because of the language of the problem domain, vs, the language of data type or "programming"). Anyway, yeah, Hamming is a more general algorithm, and I find that very often this exercise is the first people hear about it.

iHiD commented 3 years ago

@cmcaine I really like that nucleotide count introduction a lot. PR it and let's get it merged. Possibly without the last line though :wink:

Clear language is one of the few hills that I would choose to make my stand on, and a bike-shed that I care to paint! 🙂

I'd appreciate y'all thoughts on https://github.com/exercism/problem-specifications/pull/1718

SleeplessByte commented 3 years ago

I think, as with my other comments, that most of this can be solved by not choosing.

It is problematic that people don't now what HTML actually means; yet writing it out only hints and doesn't explain. That said, as a non-native, nativelike speaker, I absolutely hate abbreviated words such as YMMV and ICIMIY and IANAL, because it actively stops my train of thought -- I have to spell it out, say it out loud and sometimes even translate it back to the correct Dutch paradigm, because some of these expand to a colloquialism or proverb or otherwise abstract thing.

  1. Don't use abbreviations ever, unless you're naming something (acronym), or when it's expanded form is not English (e.g. ie., etc., simply because many people don't know what these stand for).
  2. When using acronyms,
    • if the acronym itself can be considered widely known, write the expanded form out behind it (e.g. HTML (HyperText Markup Language) doesn't lend itself well to represent DNA (Deoxyribonucleic Acid).
    • otherwise start with the expanded form, and then use the acronym (e.g. The Department of Motor Vehicles (DMV) is filled with sloths. That's why everything takes forever at the DMV).
  3. Be generous with links and definitions. I would even consider explaining what a queue or stack is, if I mention it at all.
  4. If the thing the acronym describes doesn't matter to complete an exercise (for example DNA doesn't need to be understood to finish Hamming or Nucleotide Count), you SHOULD write a disclaimer that it doesn't need to be understood.

You shouldn't have to choose when you explain it or not. Always explain it, but choose between having the acronym first (NASA, JPEG, GIF, PHP) or the unabbreviated form first.

cmcaine commented 3 years ago

I think those are good rules.

I would tend not to expand well-known abbreviations like NASA, JPEG, etc, and instead just explain what they are. e.g. "NASA, the American Space Agency" or "JPEG, a common image file format" unless it was important to be precise: "The Joint Photographic Experts Group (JPEG) define the popular JPEG File Interchange Format. Given a file in this format...". But whatever. Either way is fine.

wneumann commented 3 years ago

I would even take it a step further and say that in some cases, using the expanded form is harmful.

One of those cases is the aforementioned JPEG. I can't imagine a case where someone is familiar with the Joint Photographic Experts Group, but isn't familiar with JPEG, but I know many, many, many people who are in the opposite situation. But beyond that, knowing what JPEG stands for doesn't offer any information about what JPEG is. And throwing out the term Joint Photographic Experts Group can only distract from an exercise involving JPEG compression. Even worse would be mp3 as I can picture people wondering what the hell moving pictures have to do with audio…

I'd put DNA (another term kicked around in these threads) in the same bucket. Knowing the expanded term offers no useful information for the student with respect to solving the exercise, whereas knowing what the abbreviation/initialism refers to does.

kotp commented 3 years ago

MPEG-1 Audio Layer 3 does very well to explain that it is the audio layer, and, in my mind, helps. The fact that the MPEG part is there, if known, shows that it is/was used as a layer with digital (rather than analog) video. But then I was around when these things were coming about, and have worked in studios. So I don't know if that background biases way too much of what that tells me by having it "uncompressed" from MP3.

Not that I disagree with the idea that there are times when the abbreviation can be used as a "pronoun" is, and can be defined well enough in the context that is surrounding it. Some of these become easily found/understood with that context. But there have been enough times that I have had to look up abbreviations and acronyms and have been confused, and have chosen the wrong definition, because the context was not helpful. (Perhaps just a lack of familiarity in the English language or the modern uses.)

kotp commented 3 years ago

"NASA, the American Space Agency"

I would avoid capitalizing the last three words there in case it too strongly suggests that it is part of the abbreviation. It might lead someone to convince themselves, falsely, that it is "North American Space Agency" which would be unfortunate. This is a good example, in a way, of how our words influence what someone might believe is authoritative. Instead, "NASA, the United States space agency" would be informative, without suggesting the tie to the abbreviation. (American being associated with two continents around these parts), yet most of the countries in America are not associated with NASA.

SleeplessByte commented 3 years ago

I think those are good rules.

I would tend not to expand well-known abbreviations like NASA, JPEG, etc, and instead just explain what they are. e.g. "NASA, the American Space Agency" or "JPEG, a common image file format" unless it was important to be precise: "The Joint Photographic Experts Group (JPEG) define the popular JPEG File Interchange Format. Given a file in this format...". But whatever. Either way is fine.

Yah.

I think those are good exceptions to the rule, and shouldn't be the rule themselves.

My suggestions are all based on explanation. So if the expanded form doesn't offer explanation, don't expand, but explain 🔥🔥🔥

cmcaine commented 3 years ago

Cool. Your rules before imply that the expanded form should always be given, so that could do with some rewording in the final guidance, maybe, if the consensus is that all terms should be explained, but not necessarily expanded.

SleeplessByte commented 3 years ago

Yah I have been convinced by all of you that occasionally we shouldn't expand. Then I was thinking: we'll why do we expand in the first place?

It's because we try to explain. So the core reasoning is that we want to be clear. Therefore the rule should be: always explain.

cmcaine commented 3 years ago

@iHID, there seems to be some consensus that a variant of SleepingByte's suggestions are good. Here's my attempt at a revision to match:

BEGIN

Abbreviations are often more difficult to understand than other phrasing options.

  1. Many abbreviations are jargon. Avoid jargon where possible by using alternative language.
  2. Don't use abbreviations unless either:
    1. the abbreviated term is better known than the unabbreviated term
    2. the text will be excessively verbose without abbreviation
  3. When using abbreviations always explain what the term means on its first use
    1. This will often, but not always, include expanding the abbreviation. Note that it will rarely be sufficient to only expand the abbreviation.

Guidance examples:

  1. Prefer "if I recall correctly" to "IIRC", "as far as I know" to "AFAIK" and so on.
  2. Prefer queue and stack to FIFO and FILO.
  3. Instead of describing an interface as "RESTful", say what specific properties it has
  4. Don't talk about CRUD if you don't need to
  5. Don't use a jargon-heavy theme unless it really adds to the exercise
  6. "HyperText Markup Language (HTML) is the language used to describe document structure and content on the web" (expanded and explained)
  7. "DNA, a set of chemical instructions that influence how our bodies are constructed" (not expanded because "deoxyribonucleic acid" is unlikely to help explain what DNA is to our audience)
  8. "NASA, the United States' space agency, launched the Mariner 2 space probe in..." (not expanded because the "National Aerospace and Space Administration" is much better known by its acronym than by its expanded name)
  9. "The Department of Motor Vehicles (DMV) is filled with sloths. That's why everything takes forever at the DMV"

END

Separate section:

When the theme of an exercise includes a potentially complex or confusing topic that doesn't matter to complete an exercise (e.g. DNA and nucleotides in the nucleotide count and Hamming distance exercises) include a disclaimer that it doesn't need to be understood to complete the exercise.

This revision doesn't account for latin-derived abbreviations like e.g., i.e., etc. I think they should be explicitly excepted from this rule or barred entirely (I prefer that they are permitted and excepted from the requirement to explain).

I've deliberately excluded this example because I don't think it is very clear: "HTML (HyperText Markup Language) doesn't lend itself well to represent DNA (Deoxyribonucleic Acid)".