Closed dae closed 2 years ago
Would such a script live in the lib crate rslib/i18n/src
? I assume you have a release script somewhere that would call this script?
This would be done with regex, right? Maybe some stricter naming conventions would need to be enforced then. Two problems come to mind:
tr
, mainly in svelte, where there's a collision with the HTML tag of the same name.someString2
could be some-string-2
or some-string2
, and indeed, both styles are currently in use.Actually, the second point isn't as much of a problem for the new script as for the already existing code, because two strings could be mapped to the same camel case name. So maybe the second variant should be banned regardless.
It might make more sense as a separate crate like rslib/i18n_helpers - while we could add [[bin]] entries into rslib/i18n, that library is intended for embeddding and includes the ftl data with include_str!(), which we might not want/need for the helpers. But no strong preference either way.
Good points on the naming issues. Regex is likely going to be a fair bit simpler than trying to navigate ASTs, and adjusting our tr references into a consistent format is probably the easiest approach (if avoiding 'tr' as a separate variable is not practical, maybe we could standardize on one alternate name for cases when tr can't be used).
We could initially test with just the desktop strings, and attempt to remove any unused ones - if the build breaks, we'll know a reference has not been caught.
The numbers issue is a bit of a pain. I think we'll need to:
a) write a new bazel test that reports any keys that are named inappropriately, so we don't introduce new ones - I'd probably be lazy and use Python for this, but up to you b) write a script to fix the current ones.
For b), there's an ftl/transform-string.py script that is close, but it was intended for changing the text of a message, not its title. It can presumably be copied and adjusted fairly easily. It expects an 'anki-i18n' folder to be one folder above the desktop checkout, and inside it should be 'core' and 'qtftl', corresponding to the core and qt translation repos (eg https://github.com/ankitects/anki-core-i18n). You could send PRs through for the changes to those repos, but it's probably better for me to run the script here, so I can do it immediately after a translation update to minimize the chance of conflicts.
if avoiding 'tr' as a separate variable is not practical, maybe we could standardize on one alternate name for cases when tr can't be used
That seems to be doable. I could only find the aliases "tr2" and "translate" so far. As for false positives, there are a number of other attributes of "tr"/"I18n", but I guess those are unlikely to obscure the name of any translation string, so it doesn't matter if they get added to the hashset.
Do you have a preference how numbers should be separated? Apart from the letter-number case, there would need to be a consistent rule for adjacent numbers (a12
vs. a1-2
vs. a-12
vs. a-1-2
).
Would a) go into ftl/format.py/check_file()
? Should I put b) in the same folder?
I've just discovered tr.statisticsRange_1YearHistory()
, so it looks like I've been wrong and the camel case conversion is really done in an unambiguous way. 😃
Sorry for the false alarm. This makes things much easier.
Here's another blow: The fluent crate doesn't implement serialization. There's https://github.com/projectfluent/fluent-rs/pull/184, but it is outdated and not compatible anymore. Should I try to fix it for the current version? The AST types seem to have changed quite a bit, but I don't think it would be much more work than redoing my previous work in Python, because apart from writing out the ftl files, the prototype is mostly finished. I might be wrong though.
Ah, that's a shame. We do have a third option as a backup plan - just printing the orphaned strings to the console, and relying on a user to clean them up manually.
Having the ability to serialize FTL in the future could be useful, so perhaps it's worth looking into how much work would be required to bring #184 up to date. If it looks like it'll take much more than an hour or two, then perhaps it might be best to just kick the problem down the road, and print unused keys to the console for now?
I've looked into it and it doesn't seem like it will take more than a few hours. For one thing, I can implement it for string slices instead of a generic type.
The codegen we used for .ftl files (eg tr.browsing_all_fields()) is great for catching references to missing strings, but does not help us in knowing which strings are no longer used by the codebase. Currently we tend to remove strings from the .ftl files at the same time they are removed from the codebase, but this means that minor bugfix releases can not update to the latest translations.
Another thing to consider is that AnkiMobile, and likely AnkiDroid in the future, also reference the translations, and they need to be considered too.
To address this, I think we need a few scripts to implement a garbage collection approach:
anki-stable.json anki-HEAD.json ankimobile-stable.json ankidroid-stable.json
Where 'stable' corresponds to the last major update (eg 2.1.45), not the absolute latest point release. We'd probably want to check the -stable files into git; the -HEAD one we could generate on the fly.
The script would read each of those files, and build a set()/HashSet of all used strings. It would then iterate over the .ftl files, build them into an AST (eg transform-string.py or rslib/i18n/extract.rs), remove the unused keys, and then write the updated .ftl files back out.
We could then periodically update a -stable file when a milestone release is met, and old strings could be cleaned up over time.