Import error caused by duplicate msgid in gettext (po) files

luspi commented 4 years ago

Describe the bug When I try to import a po file into a fresh traduora instance the import fails if there are duplicate msgids in the po file, even though the msgctext is different. The error message I get is "A resource with this value already exists. Please use a different one.".

To Reproduce Steps to reproduce the behavior:

Go to 'Import'
Select the po file and a locale
Click on 'Import'
See error

Expected behavior The file is properly imported.

Environment (please complete the following information):

Device: Desktop
OS: Arch Linux
Browser: Firefox
Version: 78

Additional context Here is a very simple po file exhibiting that behavior:

msgstr ""
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"X-Language: de_DE\n"
"X-Qt-Contexts: true\n"

msgctxt "somewhere|"
msgid "Close"
msgstr ""

msgctxt "else|"
msgid "Close"
msgstr ""

halostatue commented 3 years ago

The data model for Traduora doesn’t appear to be complex enough to handle Gettext translations. A few points:

import { Label } from './label';

export interface Term {
  id: string;
  value: string;
  labels: Label[];
}

In Gettext, a term can’t be identified by a single value, unless that value is encoded (which will complicate display on the web). Strictly speaking, a model for Gettext terms should probably look something like:

export interface Term {
  id: string;
  msgid: string; // required. The message ID for the translation.
  msgid_plural: string; // The plural form for the translation;
  msgctxt: string; // The context for the translation.
  domain: string; // The domain of the translations.
}

msgid is the main key for the translation. Because of the way that Gettext works, this can be and is typically a sentence ("Hello, world."), not just a simple label ("hello_world"). (This means that the advice given in #86 is incorrect with respect to Gettext. If the limit of 255 is required, then I would suggest considering the use of a hashed value for the key and a display value for what is imported and exported. I will expand on this suggestion below.)

msgid_plural is an extra label for pluralizable messages.

The msgctxt is optional, but if provided, allows for duplicate msgid values.

The domain is typically the .po/.mo file being referenced, but it gets stored as its own name at runtime. This can probably be ignored for a translation service because the domains are typically separate files (e.g., an errors.po domain for errors and a default.po domain for most everything else), and can be modelled in Traduora as multiple projects.

This means that the unique identifier for a Gettext term is domain, msgid, msgctxt. Also note that the comments in Gettext are also meaningful:

#: src/reference.ts:234: the term was found on line 234 of src/reference.ts.
# Instruction to translators
#. Comment associated with the translation extracted from the source.
#, flag comment: flags.
- fuzzy (a similar term was found so its translation has been copied over; it indicates that the translation might not be correct in context)
- *-format, no-*-format: Indicates that the translations are generated by xgettext
#| msgid PREVIOUS-UNTRANSLATED-STRING

Examples (as a .pot file; which is pure term definitions):

msgid "This is a simple translation."
msgstr ""

#. The province of British Columbia. It is spelled differently in French.
msgid "British Columbia"
msgstr ""

#. The province of British Columbia. Used as a short postal code form.
msgid "British Columbia"
msgctxt "postal"
msgstr ""

#. How long can an untranslated string be?
msgid ""
"An untranslated string "
"can be specified on multiple lines. "
"This allows paragraphs. The same "
"applies to msgid_plural forms."
msgstr ""

#. How many lights did Picard see?
msgid "There is one light."
msgid_plural "There are %{count} lights."
msgstr ""

The last Is interesting; I do not know if it is legal to have the same msgid with different msgid_plural values; I do not believe so, but it may be possible. When asking for a plural translation, you have to provide both the msgid and msgid_plural values.

gettext("This is a simple translation.");
gettext("British Columbia");
pgettext("postal", "British Columbia");
gettext("An untranslated string can be specified on multiple lines. This allows paragraphs. The same applies to msgid_plural forms.");
ngettext("There is one light", "There are %{count} lights.", 4, 4);

gettext, pgettext, and ngettext end up translating to dnpgettext like this (there are varargs for the values at the end; the first value — the d is the domain; I am using null here, but in the gettext library that I have used / ported, it gets set to default or some other default value during initialization).

dnpgettext(null, null, "This is a simple translation.");
dnpgettext(null, null, "British Columbia");
dnpgettext(null, "postal", "British Columbia");
dnpgettext(null, null, "An untranslated string can be specified on multiple lines. This allows paragraphs. The same applies to msgid_plural forms.");
dnpgettext(null, null, "There is one light", "There are %{count} lights.", 4, 4);

The point of this is to point out that the definition of term needs to change, and the importers need to change to support this change in order to support Gettext properly. The minimal change that also allows for you to keep the 255 character limit would be something like:

import { Label } from './label';

export class DisplayTerm {
  context: string;
  term: string;
  plural: string;
}

export interface Term {
  id: string;
  value: string;
  display: DisplayTerm;
  labels: Label[];
}

The DisplayTerm class would only be used if needed, so little change would be needed unless you wish, except on the front-end and on the edges.

When importing from gettext, it would always fill display values (msgctxt to context; msgid to term; msgid_plural to plural). The values here would be concatenated—probably with a binary value like ASCII 28 (FS)—and then hashed (md5 is probably sufficient, but sha256 or sha512 are both fast enough and give more key space) and used as the value in the modified Term.
When exporting to gettext, it would be necessary to use the DisplayTerm structure as noted.
When editing terms, it would be necessary to allow the creation and editing of context, term, and plural.
When editing translations, note that, depending on the language, there can be N plural translations for the single plural term. Some languages like Japanese only have a single case regardless of number. Others have two cases (like English) or three cases (Russian). Irish Gaelic has 5 and Arabic has 6.

And, as #147 requests, it should be possible to import a .pot file, which just imports terms into the database.

I hope this helps; unfortunately, I have no time to look at making the changes on this project myself nor do I wish to sign a CLA to do so. I’m not a gettext expert, but I have had to become excessively familiar with it over the last 18 months, including updating and fixing a JavaScript implementation of gettext (unpublished as yet).

anthonynsimon commented 3 years ago

First of all, thank you for the incredibly comprehensive reply. It's very helpful!

I think your proposed implementation would work well. However, I've been evaluating the way terms/translations are structured on the database, and in particular how to best store plurals and contextual information in a way that would play nicely with the other formats.

Zooming out a bit, ideally there would be an intermediate "high-level" format that can be cleanly transformed into any other representation such as Gettext, XLIFF, YAML, and so on. This high level representation is what would be stored in the DB, in a way that is easy to search, update from any format, and version control too (a highly requested feature).

What I'm getting at: unfortunately, the current database structure is not flexible enough to easily accommodate these new features, and I've been out of capacity to maintain this repo for some time now. Which means it might take a while until this can be implemented.

I will add this to the Traduora 1.0 milestone, I do think the DB schema has to be reworked before the other features can be worked on.

Your feedback on this post is really valuable for the new schema design. Once again, thanks a lot for your time and help!

jakub-zawislak commented 2 years ago

I had the same error while importing these two lines. I think the problem was a trailing space in terms

engineer / nlpv2 / training / form / add ,Training hinzufügen
engineer / nlpv2 / training / form / edit ,Training bearbeiten

After removing the spaces from terms, it imported correctly

engineer / nlpv2 / training / form / add,Training hinzufügen
engineer / nlpv2 / training / form / edit,Training bearbeiten

ever-co / ever-traduora

Import error caused by duplicate msgid in gettext (po) files #142