[DX] Support some basic markdown-flavoured formatting in t()

klonos commented 4 years ago

Well, more like translator experience rather than DX, but I would like to propose that we introduce some basic markdown support in t(), to reduce the amount of HTML tags in our translatable strings, and make them shorter at the same time. I thought that it would be nice to support something that is popular in documentation writing, such as markdown style.

t() allows for @, %, and ! for variable "placeholders" within the translatable string. Although @ and ! are special in that they respectively pass the text trough check_plain() or not, % is basically the same as @, with the only difference being that it also wraps the text in  when rendered.

% seems to have been arbitrarily selected at a time when no other popular format was around, and is kinda "develop-y" if you ask me. With markdown being such popular in the documentation world, I would like to propose supporting the following in t():

*: works exactly the same as %, wrapping the text around . We still keep % for backwards compatibility.
**: works the same as %, but wraps the text around  instead of .
***: works the same as %, but wraps the text around both  and .
` (back-tick): works the same as % , but wraps the text around <code> instead of .

So instead of this:

backdrop_set_message(t('Here is something in %italics.', array('%italics' => $some_text)), 'info');
backdrop_set_message(t('Here is something in <strong>@bold</strong>.', array('@bold' => $some_text)), 'info');
backdrop_set_message(t('Here is something in <strong><em>@bold_and_italics</em></strong>.', array('@bold_and_italics' => $some_text)), 'info');
backdrop_set_message(t('Here is some <code>@code</code>.', array('@code' => $some_code)), 'info');

...we could be doing this, which I believe is both shorter and easier for translators (especially the ones already familiar with markdown):

backdrop_set_message(t('Here is something in *@italics*.', array('@italics' => $some_text)), 'info');
backdrop_set_message(t('Here is something in **@bold**.', array('@bold' => $some_text)), 'info');
backdrop_set_message(t('Here is something in ***@bold_and_italics***.', array('@bold_and_italics' => $some_text)), 'info');
backdrop_set_message(t('Here is some `@code`.', array('@code' => $some_code)), 'info');

Advocate for this feature: @klonos

ghost commented 4 years ago

How would this work re. backwards compatability? I.e. strings that already include those characters (not for formatting purposes)...

Also, if possible, what about including support for all/more markdown syntax?

indigoxela commented 4 years ago

How would this work re. backwards compatability?

Good point, I don't think this would work. Switching to markdown retroactively when there are already so many strings in core and contrib is almost a guarantee for trouble.

indigoxela commented 4 years ago

Here's an example how one could wrap some translatable text in <code> without using tags inside translatable strings.

$text_in_markup = '<code>' . t('inner text') . '</code>';
$message = t('Here is some other text containing the !markup', array(
  '!markup' => $text_in_markup,
));
backdrop_set_message($message, 'info');

And just code, without translating the inner string:

$code = '<code>if (!isset($something)) {}</code>';
$message = t('Here is some other text containing the code: !code', array(
  '!code' => $code,
));
backdrop_set_message($message, 'info');

This works without any changes in core.

klonos commented 4 years ago

How would this work re. backwards compatability? I.e. strings that already include those characters (not for formatting purposes)...

Good question! ...I don't have everything figured out 100% yet, but I did a quick search across the codebase, and back-ticks seem to only be used in a few .md files, some inline code comments, and some other non-php code. So we're OK there.

Now, specifically for use within t(), I've run a regex search for (?<![A-Za-z])t\('.*.*'` and nothing came up.

A similar search for (?<![A-Za-z])t\('.*\*.*' returned only 10 results, most of which were "false-positives", such as in https://github.com/backdrop/backdrop/blob/1.x/core/modules/views/plugins/views_plugin_display_block.inc#L286 :

'wildcard' => t('Wildcard (*)'),

Such cases can be accounted for, by looking for a minimum of 2 asterisks in the text (pairs of * within text - one before, and one after the bolded/italicized portion).

There was one legitimate case though, which would break the formatting of the text, as it had two asterisks in different places in the same text: https://github.com/backdrop/backdrop/blob/1.x/core/modules/layout/plugins/access/path_layout_access.inc#L74

t('Enter one path per line. The "*" character is a wildcard. Example paths are "node/1" for an individual piece of content or "node/*" for every piece of content. "@front" is the front page.', array('@front' => '<front>'))

Possible solutions I can think of in such cases are:

split the text into 2 concatenated t()s (so that there's only a single occurrence of an asterisk is each t()) - not a big fan of that 👎
allow escaping asterisks and back-ticks

Also, if possible, what about including support for all/more markdown syntax?

I'm afraid that we'd be better off using a 3rd party library in that case (such as https://github.com/erusev/parsedown). This would be a much bigger change than what I had in mind, but sure, it can be a follow-up.

My goal here was to reduce pain points for translators, by removing the need for HTML tags in order to do some common formatting (bold, italicized, and the combination of those). Using markdown instead, with the assumption that most of translators are either already familiar with, or it'd be easier for them to learn than HTML (asterisks are easier to type, and less lengthy than the  and  tags).

Switching to markdown retroactively when there are already so many strings in core and contrib is almost a guarantee for trouble.

Won't argue there, but as mentioned earlier, a preliminary search resulted in just a couple of cases in core 🤷

Here's an example how one could wrap some translatable text in <code> without using tags inside translatable strings.

Ummm no! HTML tags in translatable strings cannot (should not?) be avoided, and we have in fact been using <code> and  in t() throughout core for ages (even before Backdrop).

I (sorta*) agree with the second code example you've provided @indigoxela (where the placeholder may contain only code/variables and other non-translatable text), but not with the first example, where a portion of a translatable text is split to another variable. Please take the time to read through https://www.drupal.org/docs/7/api/localization-api/dynamic-or-static-links-and-html-in-translatable-strings

The general rule of thumb for translatable strings is that we keep inline markup for translation but avoid block tags ...When it comes to inline markup however, things are different. It would not be wise to split up a sentence just because you try to emphasize a part ...This achieves HTML and text separation, but makes life hard for you and translators alike. In these simple cases, just include the inline markup, and let the translators live with the fact. It provides them more context instead of the two unique strings you'd have with the above example.

The emphasis above is mine, and it's the problem I'm trying to solve (not eliminate entirely - but improve). If translators have to "live with" the fact that inline markup is included in translatable text, then IMO it would be better to use something simpler/easier than HTML; and arguably markdown fits that.

* if we were to split formatting of non-translatable things outside t() for the purpose of formatting it (that is wrapping it in HTML tags), then I'd rather we introduced some additional placeholders. Rationale for this: t() already supports @, %, and ! as placeholders. The @ and ! are "special" in that they decide if things will be passed through check_plain() or not. % on the other hand is basically the same as @ (you can say that it just "extends" it), with the only difference that it wraps the variable in . Perhaps we could introduce some additional placeholders (such as # and +), which do the same thing but render  and <code> instead. I'd support something like that, which feels that should happen in backdrop_placeholder(), but separate issue please.

klonos commented 4 years ago

Anyway, here's a PoC PR: https://github.com/backdrop/backdrop/pull/3240

Please see the PR for instructions on how to test things.

indigoxela commented 4 years ago

@klonos You don't seem to be aware that this also affects all translations. :-1:

klonos commented 4 years ago

I'm aware, we already have an issue label for that 😉

Besides, the only text that I have found to be affected is https://github.com/backdrop/backdrop/blob/1.x/core/modules/layout/plugins/access/path_layout_access.inc?rgh-link-date=2020-08-13T13%3A41%3A07Z#L74

indigoxela commented 4 years ago

I'm aware, we already have an issue label for that

That's not what I meant.

If translations are using markdown, we require translators to be aware of that and they also need to know, what they must escape and how to do that. Is it a bit clearer now, what my concern is?

Think of a non-technical user who wants to translate following text: text with a quintuple asterisk: ***** (rendered as a single bolded asterisk) or this ***bold and italicized*** vs \*\*\*regular text\*\*\*

klonos commented 4 years ago

Whether we like it or not, the official guidelines say that the option we currently have for example is this:

backdrop_set_message(t('This is a very important message that needs to be translated.'));

Again, this ^^ (using inline HTML tags) is the recommended way to do things as per the Drupal Localization API documentation - I'm not making it up.

This is NOT advised:

$important = '<strong>' . t('a very important message') . '</strong>';
$translated = t('needs to be translated');
backdrop_set_message(t('This is @important that %translated.', array('@important' => $important, '%translated' => $translated)));

So what I am suggesting with this feature is to allow this:

backdrop_set_message(t('This is **a very important message** that *needs to be translated*.'));

So, from a translator PoV, compare this:

This is a very important message that needs to be translated.'

...to this:

This is **a very important message** that *needs to be translated*.

The Drupal official documentation literally states that translators "should just live with it" (HTML). I am suggesting that we support and endorse markdown, while still keeping the HTML tags support for backwards compatibility.

Markdown is a markup language that is easily readable in the source format, and can be rendered into HTML. It has limited formatting options (still more than enough for what we need in t() functions), but is widely used in README files and note-taking apps. Non-developers understand it.

HTML is the format used on web sites to render content. You need web dev skills to understand it, and non-developers often get confused by HTML tags within text (much more than Markdown).

klonos commented 4 years ago

@indigoxela this was meant as a demonstration to show that the code allows escaping asterisks if need be; it was not meant to be a real life example 😄

***bold and italicized*** vs \*\*\*regular text\*\*\*

I cannot think of a real-life use case where one would need to wrap text in multiple escaped asterisks, but if someone does, the code supports it.

My point was that this:

***bold and italicized***

...is much easier to understand and translate, than this:

bold and italicized

klonos commented 4 years ago

...for the fun of it, I asked my wife (non-developer) to translate this to Greek:

bold and italicized

...and she did:

<δυνατός>έντονα και πλάγια</δυνατός> (she was not sure how to translate "em" 🤷 )

See my point now?

ghost commented 4 years ago

Perhaps we could introduce some additional placeholders (such as # and +), which do the same thing but render  and <code> instead. I'd support something like that, which feels that should happen in backdrop_placeholder(), but separate issue please.

Would that not be a possibly better alternative to this issue...?

klonos commented 4 years ago

Would that not be a possibly better alternative to this issue...?

Not really. It would solve situations where we need to format variables within text, like this:

backdrop_set_message(t('There are @total users with the %role role.', array('@total' => ..., '%role' => ...)));

...but it would not solve the formatting issue where we need to format literal words like the example I've provided in earlier comments:

backdrop_set_message(t('This is a very important message that needs to be translated.'));

ghost commented 4 years ago

Ah true...

klonos commented 1 year ago

...not directly related to this feature request here, but another testament to the popularity of at least some of the conventions of markdown, which apparently have now started making their way to WYSIWYG editors (this is the new "autoformatting" feature in CKEditor 5):

avpaderno commented 1 year ago

I like the idea of using less HTML markup in translatable strings, but I see these "issues" with that.

There are 7441 strings that are passed to t(), in Backdrop core, if I correctly understood what 7441 source strings in total on https://localize.backdropcms.org/ means. Changing all those strings to replace HTML with Markdown would make Backdrop available only in English, until the translators don't translate the new strings.
If the strings passed to t() are not changed to use Markdown in a single step, translators would see HTML in some strings and Markdown in other strings. That could be more confusing for them.
The translation server running on https://localize.backdropcms.org/ is using Drupal 7. If there are changes that need to be done in the server, for example to show a help text that explains to translators that the strings to translate use Markdown, that would require changing a Drupal 7 module.

docwilmot commented 1 year ago

I think the idea is kind of cool though. Would it do to add an options key in t() to allow md? t('Here is something in %italics.', array('%italics' => $some_text), array('md' =>TRUE)) Maybe even add a shortcut to core (fake code follows):

function tm($string, array $args = array(), array $options = array()) {
  return  t($string, array $args = array(), array $options = array('md' =>TRUE))
}

docwilmot commented 1 year ago

P.s: I havent tried to understand the concerns above, just spitballing here.

klonos commented 1 year ago

@kiamlaluno not all t() strings have % or <code> in them, so it would be less than those 7441 strings to update translations for. I'm not saying that it won't still be some serious work for translators - I understand that, and it should definitely be factored in.

I like @docwilmot's idea of introducing a new tm() function (for "t markdown"), and only start using it in new strings, perhaps also slowly converting old instances of t() strings as we go.

I would go a step further to suggest that we make the default t() allow markdown, and then convert all current instances of t() to something like tl() (for "t legacy"). I think that this would make it easier to transition to having the t() we all know simply support markdown.

Perhaps we don't need all that if we make things backwards compatible (still accept the good old @, %, and !) and simply add some markdown alternatives for them.

Perhaps we could script-convert all current translation files too? 🤔

avpaderno commented 1 year ago

I have now an account on localize.backdrop.org to be able to translate to Italian (my first language).
I am checking all the translatable strings and I found only a string that uses HTML markup (on 330 strings I checked): Set this to Yes if you would like this category to be selected by default.

avpaderno commented 1 year ago

On localize.backdropcms.org, I counted 658 strings using HTML markup, including those using <a>. <li>, , and other "more complex" tags on a total of 9812 strings. (Let's round it up to 700 over 9800 strings.)

Some translatable strings could use HTML markup, but they don't use it, for example those strings with a reference to a function, which should show the function name, as in <code>backdrop_alter()</code>.
I also noticed some strings with reference to Drupal or Drupal.org. I am not sure those are strings used by Backdrop code; they seem more strings imported by the Drupal localization server.

I am not sure it's worth implementing this feature, considering there are strings that use <a>.

indigoxela commented 1 year ago

Just a quick warning: t() isn't only used in core, but also in all contrib and probably custom modules, and also in themes. As far as I understand the alternate approach described here in some comments, this would also break i18n (the module extending core translation features).

IMO the behavior of the t() function, which is so crucial in D and B, is very hard to change. At least without breaking something else.

backdrop / backdrop-issues

[DX] Support some basic markdown-flavoured formatting in t() #4535