eemeli / message-resource-wg

Developing a standard for Unicode MessageFormat 2 resources
4 stars 1 forks source link

Enumerate supported metadata/properties for messages, sections & resources #19

Open eemeli opened 9 months ago

eemeli commented 9 months ago

In addition to supporting @tags in general at the resource syntax level, we should figure out common meanings for some property tags. Doing so would also inform further discussion on how to determine which (if any) properties might have any formatting runtime impact.

The following prior art may be relevant, especially as it'll define developer expectations. Are there other similar definitions that are relevant?

To get this started, at least these tags should be considered (in no particular order):

Note: List updated 2024-12-18 based on comments from @bcolsson and @flodolo.

@version

Allows for explicitly versioning a source string, so that it can be changed. This allows for differentiating typo fixes from actual changes in message contents. This doesn't have a runtime impact, but the (id, version) tuple can be used by tooling instead of just the message id to uniquely identify a message and its translations. The @version value probably should not be fixed to mandate semver or any other spec, but also allow date strings or anything else -- as long as the value is new for this message, it can be treated as a new version.

@param

For documenting variables. No runtime impact, but very significant for translators. Having a well-defined structure for this tag is pretty important, at least to identify the variable its description is pertaining to. In addition to describing the variable in words, it could include:

@obsolete

(was @deprecated)

Explicitly mark a message (also a section/entire resource?) as obsolete. This could be used in workflows where messages are not immediately removed when they are no longer referenced by code, but kept in to support patch releases for previous versions. During translation, this can be used to de-prioritize such messages. This tag could include a way to note some version or timestamp when the removal happened, or be paired with a second @removed-in or similar tag.

@locale

Establish the locale code for messages, probably only at the resource level? Many localization systems depend on the locale code being effectively encoded in the path, but being able to represent it within the resource could prove very useful (much like XLIFF). This could well have a runtime impact as well, esp. when accounting for fallbacking and the formatting of messages in resources coming from a different locale.

@format

While this resource format is being designed primarily with MF2 in mind, it's at least possible to consider supporting other message formats within it as well. Experiences with .properties have shown that being able to explicitly define the format for a resource (or even a single message) would make its processing significantly easier, both during translation and formatting.

@schema

MF2 messages will use functions. The core/default set will include a few like :number and :string, but implementations and users are free to extend and override these with their own. We should be able to define a reference in a resource to the schema or registry that's defining such messages. This tag may have a formatting runtime impact, if an implementation can use it to load the required functions dynamically.

@do-not-translate

Mark a message or section as fixed for all locales. It should be available in all locales, but always hold the same value.

@max-length

Limit the length of a formatted message. Requires at least a numerical qualifier, possibly with a units indicator. Default should be "characters" or "code points", but alternatives like "bytes" and "lines" could also be supported. Probably no formatting runtime impact?

@allow-empty

Explicitly mark a message with an empty pattern as valid. Most empty messages are mistakes, so being able to mark ones that may be empty would be useful. Should probably be accompanied by an explanatory note.


Do you agree with all of the above? Are there aspects that I've not accounted for? What other tags should we be considering?

bcolsson commented 9 months ago

I agree with all the proposed tags.

In addition to the above, I could see a metadata tag for marking a particular string to not be translated being useful, especially for things like copyright or other legal lines. This would also be useful for excluding certain messages in a resource file from being exposed to localization.

I believe in one of the issues tagged here someone else mentioned this, but metadata for indicating character limitations would also be useful.

eemeli commented 9 months ago

I agree with all the proposed tags.

🎉

In addition to the above, I could see a metadata tag for marking a particular string to not be translated being useful, especially for things like copyright or other legal lines. This would also be useful for excluding certain messages in a resource file from being exposed to localization.

Maybe something like @readonly? Another option would be @translate no, but that seems a bit clumsy, as it implies that something like @translate yes might be useful sometimes, and that's a bit confusing.

I believe in one of the issues tagged here someone else mentioned this, but metadata for indicating character limitations would also be useful.

Would just characters be a sufficient unit for this? If so, then we could have @max-length 42. Otherwise we'll need to somehow allow for e.g. words and lines as units. Also, is there a need a @min-length as well?

It would probably also be useful to comment on tags that are mostly handled programmatically. So something like:

@readonly - Must be kept exactly the same in all locales
@max-length 42 chars, due to layout limitations

In syntax terms, this could be handled by only considering the first N space-separated tokens as significant, and allowing a - prefix on subsequent comments.

flodolo commented 9 months ago

Obsolete strings

@obsolete might be more accurate than @deprecated?

Do no translate

How do tags apply to the equivalent of a message with attributes? For example, this string.

about-logins-login-filter2 =
  .placeholder = Search Passwords
  .key = F

The "do not translate" only applies to the key attribute (BTW, I would use @do-not-translate for that tag, @readonly is quite misleading in this context).

Length limits

I agree with Bryan on the @max-length, and I've never seen a case where we need a @min-width. On the other hand, we have cases where an empty string is allowed, while in most cases it's not: do we want @allow-empty, or it's too much of an edge case?

Fun fact: we have strings where the length limitations is not measured in characters but bytes… Do we need to cover that, by providing an optional unit of measurement (characters by default)?

Content validation

We have cases where the string is not really a string but a boolean value (true, false, empty equals false). Do we want something like @allowed-values followed by an array?

eemeli commented 9 months ago

@obsolete might be more accurate than @deprecated?

Agreed, that's a better term.

How do tags apply to the equivalent of a message with attributes? For example, this string.

about-logins-login-filter2 =
  .placeholder = Search Passwords
  .key = F

TBD, probably depends on the tag. Something like @max-length probably only applies to the exact message it's attached to, while @obsolete could apply to e.g. a section, and thereby all messages within it.

At least for now, this syntax does not support attributes like FTL does, so the example above would probably end up in MF2 as

[about-logins]
...
login-filter.placeholder = Search Passwords
login-filter.key = F

and then be referred to in code as about-logins.login-filter, which would at runtime have a value very similar to the Fluent about-logins-login-filter2 message.

I agree with Bryan on the @max-length, and I've never seen a case where we need a @min-width. On the other hand, we have cases where an empty string is allowed, while in most cases it's not: do we want @allow-empty, or it's too much of an edge case?

@allow-empty seems like a rather good idea.

Fun fact: we have strings where the length limitations is not measured in characters but bytes… Do we need to cover that, by providing an optional unit of measurement (characters by default)?

That's pretty much the question, isn't it? Is the case for alternative length units (bytes, lines, anything else?) significant enough that we should support an optional unit?

We have cases where the string is not really a string but a boolean value (true, false, empty equals false). Do we want something like @allowed-values followed by an array?

There are also use cases for numerical values, and for messages consisting of CSS rules. And probably many other limitations as well. I mentioned @format above; it kinda sounds like we'll need to consider how deep to go here. Instead of @allowed-values, I think @enum might be better, as it'll may need to be complemented with something like @type.

eemeli commented 9 months ago

I've updated the list of the first comment here with suggestions from above. Left out @allowed-values/@enum for now, because I think that needs a bit more thought.

flodolo commented 9 months ago

That's pretty much the question, isn't it? Is the case for alternative length units (bytes, lines, anything else?) significant enough that we should support an optional unit?

The one is Firefox is pretty weird, IIRC it's a limitation caused by legacy hardware, which causes strings to be stored as 32 and 64 bytes. Based on that, not worth covering.

At the same time, it would be worth investigating how app stores calculate max lengths (example from Apple), because that's the most common use case nowadays. Is it obvious what a character means in this context, e.g. when looking at Latin based writing vs CJK?

eemeli commented 9 months ago

I was not able to find a character definition from Apple.

For Google Play I found this:

Character limits apply to both full-width and half-width characters — the numbers listed above are the maximum limits regardless of what type of characters you are using.

Twitter provides their own definition for which characters count double; I think that's specific to them? At least the consideration of every URL as having a length of 23 chars is Twitter-only.

bcolsson commented 9 months ago

Not sure how accurate this is but comparing the ja vs en version of this page for iOS app store, it seems that both languages have the same character limits (30 chars for title, 170 for descriptions).

Coming from a ja->en translation background, I could see the case where having the ability to specify x number of characters for Japanese and a larger number of characters for different scripts would be useful, though I have no idea how complicated such an implementation would be.

SimonClark commented 8 months ago

I've been pondering something related to this, but may be unique enough to our usecase that it doesn't justify general functionality.

We are looking for a mechanism to purge unused strings from the system, but the nature of the product is that there will be no static analysis that can ensure 100% that a string key is unused. If static analysis and test automation finds no usages of a string key, then I intend to mark it as "suspected unused". That way, if it is used in production, the string repository can be notified, and the flag can be removed.

Similar to the @obsolete tag, but maybe semantically different enough to warrant its own tag.

eemeli commented 8 months ago

If static analysis and test automation finds no usages of a string key, then I intend to mark it as "suspected unused". That way, if it is used in production, the string repository can be notified, and the flag can be removed.

This does seem different from @obsolete, and while reasonable, also rather specific to your workflow -- which is fine! To me, this poses an interesting meta-question regarding the completeness/extensibility of the tags: If we leave this out, how can you add it for your local use? As in, do we allow for extensions to the core set of tags, and if so, are there requirements like namespacing on such custom extensions?

I think we have three choices here:

  1. Do not allow custom tags; define and declare the core set to be complete and exhaustive.
  2. Allow custom tags, but require them to be namespaced, probably with : as a separator (as in the MF2 identifier).
  3. Allow all custom tags, potentially causing conflicts if the core set is ever extended.

My sense would be that custom tags should be allowed, but require or strongly suggest that they be namespaced.

SimonClark commented 8 months ago

| Allow custom tags, but require them to be namespaced, probably with : as a separator (as in the MF2 identifier). This seems like the safest option. We are unlikely to get everything right the first time.

Pragmatically, if someone is introducing custom tags in the resource, then they almost certainly have a custom implementation of the resource parser (even if it is customized via extension). Portability of resource bundles with the custom tags is less of a concern, in my mind, because of that. Non-conflict with future evolution is the primer driver, I think.

Either way, namespacing gives the most safety and flexibility.

tomasr8 commented 7 months ago

Hi everyone, I found out about this at FOSDEM and I'm looking forward to this becoming a standard (and using it in Python :smile:)

I was wondering if you've considered supporting a property/metadata for screenshots? We don't use this ourselves, but there are platforms (e.g. Transifex & Weblate) that let you attach a screenshot to a specific message to give translators more information about where and how the message is used in the UI. There would be no runtime impact, it would simply be something that GUI tools built on top of MF2 could display if they choose to do so.

To give a concrete example:

# Not sure about the property name yet
@url https://example.com/login-page.png
msg = Login

Admittedly, this could be achieved with a custom property, but perhaps having a standardized name is useful?

lucacasonato commented 3 months ago

I think that @locale should be a required parameter for the message bundle. The reason for this, is so that tooling can reliably ascertain the language of a bundle from just the file contents.

This would enable some nice usecases:

  1. A MessageFormat 2.0 language server can reliably provide language relevant editor completions in a message (like providing diagnostics if selectors are used that in a given language are not valid for a certain type.

  2. A host integration could allow importing of a message bundle, and directly return MessageFormat objects. For example in JS import { msg1, msg2 } from "./translations.mf2" (functions can be made to work too, via import attributes).

  3. Translation tools can always just "import" a message bundle file, without you having to specify what locale it is out of band.

eemeli commented 3 months ago

I had not considered requiring any of the properties, but with an explicit locale combined with decent defaults for other values, a resource can indeed be considered complete just by itself.

Often, the locale is also stored and read separately from the resource as a part of its filename or path, or otherwise. However, there is no universal specification for exactly how this is done. Including the locale in the resource is also done by other resource formats, such as gettext and XLIFF; the latter in fact has at least source-language as a required attribute of its <file>.

Adding a required resource property would make the frontmatter always required, so a minimal resource would be something like

@locale en-US
---
key = value

This is... okay? While it adds a little bit of bulk to the format, the locale does still need to be stored somewhere, and this has the side benefit of making the file format distinguishable from just its contents. Based on the research I did under #14, I'm relatively confident that @properties are not used by other formats with frontmatter. Without the frontmatter, it's possible to end up with a message resource that's indistinguishable from a .properties file.

In other words, I'm struggling a bit to find any significant downsides to requiring a locale for the resource.