Tracking used Fluent strings

dae commented 3 years ago

The codegen we used for .ftl files (eg tr.browsing_all_fields()) is great for catching references to missing strings, but does not help us in knowing which strings are no longer used by the codebase. Currently we tend to remove strings from the .ftl files at the same time they are removed from the codebase, but this means that minor bugfix releases can not update to the latest translations.

Another thing to consider is that AnkiMobile, and likely AnkiDroid in the future, also reference the translations, and they need to be considered too.

To address this, I think we need a few scripts to implement a garbage collection approach:

a script that extracts tr references from our Python, Rust and Typescript files into a set, sorts them, and writes them out to a json file. The script should accept an argument to the source folder root, so it can be extended to catch references in Swift, Java and Kotlin source files in the future
another script that takes two arguments: the ftl/core folder, and another folder that contains one or more .json files that store the extracted references. It might contain files like:

anki-stable.json anki-HEAD.json ankimobile-stable.json ankidroid-stable.json

Where 'stable' corresponds to the last major update (eg 2.1.45), not the absolute latest point release. We'd probably want to check the -stable files into git; the -HEAD one we could generate on the fly.

The script would read each of those files, and build a set()/HashSet of all used strings. It would then iterate over the .ftl files, build them into an AST (eg transform-string.py or rslib/i18n/extract.rs), remove the unused keys, and then write the updated .ftl files back out.

We could then periodically update a -stable file when a milestone release is met, and old strings could be cleaned up over time.

RumovZ commented 3 years ago

Would such a script live in the lib crate rslib/i18n/src? I assume you have a release script somewhere that would call this script? This would be done with regex, right? Maybe some stricter naming conventions would need to be enforced then. Two problems come to mind:

The ftl strings aren't always imported as tr, mainly in svelte, where there's a collision with the HTML tag of the same name.
Mapping of camel case to kebab case is ambiguous for characters without case, i.e. numbers. someString2 could be some-string-2 or some-string2, and indeed, both styles are currently in use.

Actually, the second point isn't as much of a problem for the new script as for the already existing code, because two strings could be mapped to the same camel case name. So maybe the second variant should be banned regardless.

dae commented 3 years ago

It might make more sense as a separate crate like rslib/i18n_helpers - while we could add [[bin]] entries into rslib/i18n, that library is intended for embeddding and includes the ftl data with include_str!(), which we might not want/need for the helpers. But no strong preference either way.

Good points on the naming issues. Regex is likely going to be a fair bit simpler than trying to navigate ASTs, and adjusting our tr references into a consistent format is probably the easiest approach (if avoiding 'tr' as a separate variable is not practical, maybe we could standardize on one alternate name for cases when tr can't be used).

We could initially test with just the desktop strings, and attempt to remove any unused ones - if the build breaks, we'll know a reference has not been caught.

The numbers issue is a bit of a pain. I think we'll need to:

a) write a new bazel test that reports any keys that are named inappropriately, so we don't introduce new ones - I'd probably be lazy and use Python for this, but up to you b) write a script to fix the current ones.

For b), there's an ftl/transform-string.py script that is close, but it was intended for changing the text of a message, not its title. It can presumably be copied and adjusted fairly easily. It expects an 'anki-i18n' folder to be one folder above the desktop checkout, and inside it should be 'core' and 'qtftl', corresponding to the core and qt translation repos (eg https://github.com/ankitects/anki-core-i18n). You could send PRs through for the changes to those repos, but it's probably better for me to run the script here, so I can do it immediately after a translation update to minimize the chance of conflicts.

RumovZ commented 3 years ago

if avoiding 'tr' as a separate variable is not practical, maybe we could standardize on one alternate name for cases when tr can't be used

That seems to be doable. I could only find the aliases "tr2" and "translate" so far. As for false positives, there are a number of other attributes of "tr"/"I18n", but I guess those are unlikely to obscure the name of any translation string, so it doesn't matter if they get added to the hashset.

Do you have a preference how numbers should be separated? Apart from the letter-number case, there would need to be a consistent rule for adjacent numbers (a12 vs. a1-2 vs. a-12 vs. a-1-2).

Would a) go into ftl/format.py/check_file()? Should I put b) in the same folder?

RumovZ commented 3 years ago

I've just discovered tr.statisticsRange_1YearHistory(), so it looks like I've been wrong and the camel case conversion is really done in an unambiguous way. 😃 Sorry for the false alarm. This makes things much easier.

RumovZ commented 3 years ago

Here's another blow: The fluent crate doesn't implement serialization. There's https://github.com/projectfluent/fluent-rs/pull/184, but it is outdated and not compatible anymore. Should I try to fix it for the current version? The AST types seem to have changed quite a bit, but I don't think it would be much more work than redoing my previous work in Python, because apart from writing out the ftl files, the prototype is mostly finished. I might be wrong though.

dae commented 3 years ago

Ah, that's a shame. We do have a third option as a backup plan - just printing the orphaned strings to the console, and relying on a user to clean them up manually.

Having the ability to serialize FTL in the future could be useful, so perhaps it's worth looking into how much work would be required to bring #184 up to date. If it looks like it'll take much more than an hour or two, then perhaps it might be best to just kick the problem down the road, and print unused keys to the console for now?

RumovZ commented 3 years ago

I've looked into it and it doesn't seem like it will take more than a few hours. For one thing, I can implement it for string slices instead of a generic type.

ankitects / anki

Tracking used Fluent strings #1472