Open luspi opened 4 years ago
The data model for Traduora doesn’t appear to be complex enough to handle Gettext translations. A few points:
import { Label } from './label';
export interface Term {
id: string;
value: string;
labels: Label[];
}
In Gettext, a term can’t be identified by a single value, unless that value is encoded (which will complicate display on the web). Strictly speaking, a model for Gettext terms should probably look something like:
export interface Term {
id: string;
msgid: string; // required. The message ID for the translation.
msgid_plural: string; // The plural form for the translation;
msgctxt: string; // The context for the translation.
domain: string; // The domain of the translations.
}
msgid
is the main key for the translation. Because of the way that Gettext works, this can be and is typically a sentence ("Hello, world."
), not just a simple label ("hello_world"
). (This means that the advice given in #86 is incorrect with respect to Gettext. If the limit of 255 is required, then I would suggest considering the use of a hashed value for the key and a display value for what is imported and exported. I will expand on this suggestion below.)
msgid_plural
is an extra label for pluralizable messages.
The msgctxt
is optional, but if provided, allows for duplicate msgid
values.
The domain
is typically the .po
/.mo
file being referenced, but it gets stored as its own name at runtime. This can probably be ignored for a translation service because the domains are typically separate files (e.g., an errors.po
domain for errors and a default.po
domain for most everything else), and can be modelled in Traduora as multiple projects.
This means that the unique identifier for a Gettext term is domain, msgid, msgctxt
. Also note that the comments in Gettext are also meaningful:
#: src/reference.ts:234
: the term was found on line 234 of src/reference.ts
.# Instruction to translators
#. Comment associated with the translation extracted from the source.
#, flag comment
: flags.
fuzzy
(a similar term was found so its translation has been copied over; it indicates that the translation might not be correct in context)*-format
, no-*-format
: Indicates that the translations are generated by xgettext
#| msgid PREVIOUS-UNTRANSLATED-STRING
Examples (as a .pot
file; which is pure term definitions):
msgid "This is a simple translation."
msgstr ""
#. The province of British Columbia. It is spelled differently in French.
msgid "British Columbia"
msgstr ""
#. The province of British Columbia. Used as a short postal code form.
msgid "British Columbia"
msgctxt "postal"
msgstr ""
#. How long can an untranslated string be?
msgid ""
"An untranslated string "
"can be specified on multiple lines. "
"This allows paragraphs. The same "
"applies to msgid_plural forms."
msgstr ""
#. How many lights did Picard see?
msgid "There is one light."
msgid_plural "There are %{count} lights."
msgstr ""
The last Is interesting; I do not know if it is legal to have the same msgid
with different msgid_plural
values; I do not believe so, but it may be possible. When asking for a plural translation, you have to provide both the msgid
and msgid_plural
values.
gettext("This is a simple translation.");
gettext("British Columbia");
pgettext("postal", "British Columbia");
gettext("An untranslated string can be specified on multiple lines. This allows paragraphs. The same applies to msgid_plural forms.");
ngettext("There is one light", "There are %{count} lights.", 4, 4);
gettext
, pgettext
, and ngettext
end up translating to dnpgettext
like this (there are varargs for the values at the end; the first value — the d
is the domain
; I am using null
here, but in the gettext library that I have used / ported, it gets set to default
or some other default value during initialization).
dnpgettext(null, null, "This is a simple translation.");
dnpgettext(null, null, "British Columbia");
dnpgettext(null, "postal", "British Columbia");
dnpgettext(null, null, "An untranslated string can be specified on multiple lines. This allows paragraphs. The same applies to msgid_plural forms.");
dnpgettext(null, null, "There is one light", "There are %{count} lights.", 4, 4);
The point of this is to point out that the definition of term
needs to change, and the importers need to change to support this change in order to support Gettext properly. The minimal change that also allows for you to keep the 255 character limit would be something like:
import { Label } from './label';
export class DisplayTerm {
context: string;
term: string;
plural: string;
}
export interface Term {
id: string;
value: string;
display: DisplayTerm;
labels: Label[];
}
The DisplayTerm class
would only be used if needed, so little change would be needed unless you wish, except on the front-end and on the edges.
gettext
, it would always fill display
values (msgctxt
to context
; msgid
to term
; msgid_plural
to plural
). The values here would be concatenated—probably with a binary value like ASCII 28 (FS)—and then hashed (md5
is probably sufficient, but sha256
or sha512
are both fast enough and give more key space) and used as the value
in the modified Term
. gettext
, it would be necessary to use the DisplayTerm
structure as noted.context
, term
, and plural
.N
plural translations for the single plural
term. Some languages like Japanese only have a single case regardless of number. Others have two cases (like English) or three cases (Russian). Irish Gaelic has 5 and Arabic has 6.And, as #147 requests, it should be possible to import a .pot
file, which just imports terms into the database.
I hope this helps; unfortunately, I have no time to look at making the changes on this project myself nor do I wish to sign a CLA to do so. I’m not a gettext
expert, but I have had to become excessively familiar with it over the last 18 months, including updating and fixing a JavaScript implementation of gettext
(unpublished as yet).
First of all, thank you for the incredibly comprehensive reply. It's very helpful!
I think your proposed implementation would work well. However, I've been evaluating the way terms/translations are structured on the database, and in particular how to best store plurals and contextual information in a way that would play nicely with the other formats.
Zooming out a bit, ideally there would be an intermediate "high-level" format that can be cleanly transformed into any other representation such as Gettext, XLIFF, YAML, and so on. This high level representation is what would be stored in the DB, in a way that is easy to search, update from any format, and version control too (a highly requested feature).
What I'm getting at: unfortunately, the current database structure is not flexible enough to easily accommodate these new features, and I've been out of capacity to maintain this repo for some time now. Which means it might take a while until this can be implemented.
I will add this to the Traduora 1.0 milestone, I do think the DB schema has to be reworked before the other features can be worked on.
Your feedback on this post is really valuable for the new schema design. Once again, thanks a lot for your time and help!
I had the same error while importing these two lines. I think the problem was a trailing space in terms
engineer / nlpv2 / training / form / add ,Training hinzufügen
engineer / nlpv2 / training / form / edit ,Training bearbeiten
After removing the spaces from terms, it imported correctly
engineer / nlpv2 / training / form / add,Training hinzufügen
engineer / nlpv2 / training / form / edit,Training bearbeiten
Describe the bug When I try to import a po file into a fresh traduora instance the import fails if there are duplicate msgids in the po file, even though the msgctext is different. The error message I get is "A resource with this value already exists. Please use a different one.".
To Reproduce Steps to reproduce the behavior:
Expected behavior The file is properly imported.
Environment (please complete the following information):
Additional context Here is a very simple po file exhibiting that behavior: