Flag translatable strings with multiple different translator comments

swissspidy commented 6 years ago

Hi all!

I've been encouraged to submit this possible enhancement to WPCS, so here I am :-)

Most recently I've been working on wp-cli/i18n-command, a WP-CLI command used to create POT files from WordPress projects (core, plugins, themes).

Since I've always liked the I18N sniffs in WPCS, I decided to incorporate something similar to flag strings with missing translator comments and the like.

The relevant part of the code can be found here: https://github.com/wp-cli/i18n-command/blob/963a56727d9215282c373452cb0fda0bea654879/src/MakePotCommand.php#L561-L656.

If you want to test it yourself, you can download the command individually or install WP-CLI 2.0 (currently via wp cli update --nightly) and then run wp i18n make-pot /path/to/source.

One piece that is included but isn't part of WPCS is a check for strings with multiple different translator comments. This can be potentially bad for translators because they

a) see multiple different translator comments in their translation tool (Poedit, translate.wordpress.org) b) strings can have different translations depending on the / variables

When encountering such a string it's usually a good sign that the translator comments should either be consolidated/unified so they're all the same, or that the strings should be separated completely by adding a different gettext context.

For example, the string More information about %s exists three times in WordPress core. Once, the translator comment is %s: plugin name and version, the second time it's %s: Importer name and in the third instance there's no comment at all.

The WP-CLI command then prints a warning like this:

Warning: The string "More information about %s" has 2 different translator comments. (wp-admin/import.php:175)

Note that the WP-CLI command does not print that warning when two strings have a different gettext context already. In that case, they're treated as two separate strings with a single translator comment each.

If this is something WPCS could handle as well, I'm sure it would benefit many developers by making their strings even better translatable.

JDGrimes commented 6 years ago

Thanks for creating this issue, @swissspidy. I'm not sure how easily WPCS could handle a check like this, since it scans one file at a time, not necessarily analyzing it in context of the complete codebase. But it is definitely something to keep in mind.

jrfnl commented 6 years ago

I like this idea and IMHO this is quite doable with PHPCS, though we may need to make allowances for the parallel option.

This could be implemented in a similar way as the upstream DuplicateClass sniff, where, in this case, we'd record/cache the translatable string as the key of an array with as values translators comment, context, file and line. Using that information we could check if the same string - with where applicable the same context - has been encountered before, if it has a translators comment and if so, if the translators comments match. If not, a warning could be displayed with the translators comment, name of the file + line nr from where the first string was found.

Considering the work done on the WP-CLI extension, the only question I have is whether it is desirable to maintain two - potentially slightly different - implementations of the same.

Should/Could the WP-CLI extension possibly run PHPCS with just the I18n sniff under the hood and present the results in the WP-CLI format ?

On another note and inspired by this issue - if we would be recording the translatable strings encountered anyway, what do you all think about adding a "Found a nearly identical translatable string" warning ? This could catch for instance things like Full name vs Full Name or The number of things vs The numbre of things (misspelled on purpose for the example).

GaryJones commented 6 years ago

what do you all think about adding a "Found a nearly identical translatable string" warning

+1, though I think that should be off by default, but could be enabled (like we do with textdomain checks), for those that would like it. I could otherwise imagine plenty of cases where "nearly identical" strings would be reported for legitimate instances, and while it would only be a warning, plenty of folks treat them the same as errors and aim to have them all addressed.

jrfnl commented 6 years ago

@GaryJones

though I think that should be off by default, but could be enabled (like we do with textdomain checks),

Well, this is a slightly different case though. For the text-domain, the sniff literally can't check anything without user input as it's unclear what to check against.

A "similar text" check could however always be done, i.e. it doesn't need user-input for it to work - and "hiding" it behind an on-off toggle may just prevent people from using it altogether.

Then again, as you say, this check may not need to be always-on and could be - similar to the text-domain fixer proposal - a sniff in the Utils category which would only be run once in a while on-demand.

As the sniff would need to remember and compare every text on each new string found, I expect this check to be pretty resource intensive, so that would another good argument for it to be a Util sniff/"off" by default.

Implementation wise I was thinking along the lines of adding a public (customizable) property which determines how similar the text should be before it should be reported. The default for this could be a pretty high number, like 98% which should cause very few false positives. 100% would be ignored as that would mean the texts are the same, which is good from an i18n point of view, so setting this to 100% would be a way to turn the check off altogether. People could (temporarily) pass a lower percentage to see less precise results.

I'd say, let me have a try to see if I can get it working with any degree of accuracy first and run some tests with it to see what would be a reasonable number for balance. We can then always decide to have 100% (= off) as the default value.

swissspidy commented 6 years ago

This could be implemented in a similar way as the upstream DuplicateClass sniff, where, in this case, we'd record/cache the translatable string as the key of an array with as values translators comment, context, file and line.

I'd probably use context\004original as the cache key, like gettext does. Makes the comparison easier. When just using translatable string as the key you'll quickly end up with multiple entries you then have to loop through to check the context.

On another note and inspired by this issue - if we would be recording the translatable strings encountered anyway, what do you all think about adding a "Found a nearly identical translatable string" warning ?

Very interesting. I'll keep an eye on that :-)

Not sure it would be a good fit for the WP-CLI command, but I'd definitely use it in PHPCS.

Considering the work done on the WP-CLI extension, the only question I have is whether it is desirable to maintain two - potentially slightly different - implementations of the same.

Should/Could the WP-CLI extension possibly run PHPCS with just the I18n sniff under the hood and present the results in the WP-CLI format ?

I've been asking myself this as well when I started with the initial implementation.

Including PHPCS just for this would be a bit overkill. It's rather easy for us to do it ourselves because we already extract all the strings anyway and we just have to loop through an array at the end.

Plus, we do things PHPCS can't check:

Translatable strings in JavaScript
Including/excluding translatable strings from certain files
Merging the extracted strings with an existing POT file

In all these scenarios we can still determine whether there are issues or not with the resulting set of strings.

WordPress / WordPress-Coding-Standards

Flag translatable strings with multiple different translator comments #1444