eemeli / message-resource-wg

Developing a standard for Unicode MessageFormat 2 resources
4 stars 1 forks source link

Add data model as TypeScript definitions #16

Closed eemeli closed 1 month ago

eemeli commented 1 year ago

This builds on #15, so includes the metadata as described there.

Much like the MF2 message data model, the proposed canonical resource data model is rather close to the syntactical source, but leaves out unimportant details like empty lines and other ignored whitespace. Source positions are not included, and the data model makes no attempt at being a CST representation of a parsed source.

It's expressed as TypeScript, as that allows for sufficient flexibility and expressibility. Synonymous definitions could be included later in other formats, though that might require narrowing the parametric types to some specifics.

A Junk definition is included, to allow for the representation of resources with invalid contents. The top-level Resource definition includes a type parameter that acts as a toggle for this, as there are likely to be cases where any parse error would invalidate the whole resource, and hence a representation of the resource might never include any junk.

The data model is also parametric on the metadata and message definitions. While both default to string, it's possible to use this same data model with further specifications of each, e.g. using the MF2 Message data model.

Some specific relaxations of syntax requirements are included to allow for Resource to represent resources using different syntaxes, such as Fluent FTL. As with the MF2 data model, we may want to ensure that this data model can losslessly represent many if not all localization resource formats.

eemeli commented 9 months ago

Added a JSON Schema as resource.json (rather than message.json as I'd fumbled into the commit message), following the same approach as taken for the MF2 data model's JSON Schema.

eemeli commented 9 months ago

The choice for sections to not contain their children seems to be inspired by Fluent. Back then, the assumption was that keeping a flat list of entries (messages, section heads, standalone comments) makes processing messages simpler, and that the entry-is-in-section relationship can be reconstructed in logic on demand.

Somewhat in parallel with your review, I ended up refactoring this PR so that a Resource contains sections, and each Section contains entries. The previous choice was in fact made to correspond (perhaps too closely) with the syntax, which includes a section-head rule.

My hope is that the updated proposal gives good answers to all the potential resource/message actions you mention.

eemeli commented 4 months ago

Dropped the parametric metadata. Since the last update here, I've been working on moz.l10n, a Python library that implements a version of this PR's data model.

During that work, it's become clear that the parametrization of metadata values is more of a hindrance than a help, as the uncertainty about the metadata value type needs to be accounted for at all levels, for very little benefit.

It's still entirely plausible to encode structure in metadata string values, but that should be done in a consumer of this data model.