MichaelChirico / potools

Tools for working with translations in R
https://michaelchirico.github.io/potools/
58 stars 2 forks source link

"invalid multibyte sequence" error from msgfmt on "¡" #299

Closed maelle closed 1 year ago

maelle commented 1 year ago

:wave:, thanks for maintaining potools!

I'm writing an example package, and noticed I can't use "¡" in msgid nor msgstr, is that expected?

MichaelChirico commented 1 year ago

that sounds wrong to me! can you share more info (the platform you're using, the stack trace)?

maelle commented 1 year ago

If in https://github.com/maelle/pockage/blob/a36978a1c06dcdc3dbd6200f4110c2bbaa1ba21b/po/R-es.po#L20 I add "¡" I get

> potools::po_compile()
Recompiling 'ca' R translation
Running system command msgfmt -c --statistics -o './inst/po/ca/LC_MESSAGES/R-pockage.mo' './po/R-ca.po'...
./po/R-ca.po:15:19: invalid multibyte sequence
./po/R-ca.po:15:20: invalid multibyte sequence
msgfmt: found 2 fatal errors
Warning: running msgfmt on R-ca.po failed.
Here is the po file:
msgid ""
msgstr ""
"Project-Id-Version: pockage 0.0.0.9000\n"
"POT-Creation-Date: 2023-10-06 10:45+0200\n"
"PO-Revision-Date: 2023-10-06 10:33+0200\n"
"Last-Translator: Automatically generated\n"
"Language-Team: none\n"
"Language: ca\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=ASCII\n"
"Content-Transfer-Encoding: 8bit\n"

#: mensaje.R:9
msgid "user"
msgstr "usuari/usuària"

#: mensaje.R:10
msgid "Hello {name}!"
msgstr "Hola {name}!"
Recompiling 'es' R translation
Running system command msgfmt -c --statistics -o './inst/po/es/LC_MESSAGES/R-pockage.mo' './po/R-es.po'...
./po/R-es.po:20:9: invalid multibyte sequence
./po/R-es.po:20:10: invalid multibyte sequence
msgfmt: found 2 fatal errors
Warning: running msgfmt on R-es.po failed.
Here is the po file:
msgid ""
msgstr ""
"Project-Id-Version: pockage 0.0.0.9000\n"
"POT-Creation-Date: 2023-10-06 10:45+0200\n"
"PO-Revision-Date: 2023-10-06 10:33+0200\n"
"Last-Translator: Automatically generated\n"
"Language-Team: none\n"
"Language: es\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=ASCII\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"

#: mensaje.R:9
msgid "user"
msgstr "usuari@"

#: mensaje.R:10
msgid "Hello {name}!"
msgstr "¡Hola {name}!"
Recompiling 'fr' R translation
Running system command msgfmt -c --statistics -o './inst/po/fr/LC_MESSAGES/R-pockage.mo' './po/R-fr.po'...
./po/R-fr.po:16:20: invalid multibyte sequence
./po/R-fr.po:16:21: invalid multibyte sequence
msgfmt: found 2 fatal errors
Warning: running msgfmt on R-fr.po failed.
Here is the po file:
msgid ""
msgstr ""
"Project-Id-Version: pockage 0.0.0.9000\n"
"POT-Creation-Date: 2023-10-06 10:45+0200\n"
"PO-Revision-Date: 2023-10-06 10:33+0200\n"
"Last-Translator: Malle Salmon\n"
"Language-Team: none\n"
"Language: fr\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=ASCII\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n > 1);\n"

#: mensaje.R:9
msgid "user"
msgstr "utilisateur·rice"

#: mensaje.R:10
msgid "Hello {name}!"
msgstr "Salut {name} !"

This is on:

─ Session info ─────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.0 (2022-04-22)
 os       Ubuntu 20.04.6 LTS
 system   x86_64, linux-gnu
 ui       RStudio
 language en_US.utf8
 collate  en_US.utf8
 ctype    en_US.utf8
 tz       Europe/Paris
 date     2023-10-06
 rstudio  2023.06.2+561 Mountain Hydrangea (desktop)
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

I installed potools from GitHub with pak, and didn't have to worry about the system dependency (or maybe I should!).

maelle commented 1 year ago

Apparently I also get the error for the slash in the other file https://github.com/maelle/pockage/blob/a36978a1c06dcdc3dbd6200f4110c2bbaa1ba21b/po/R-ca.po#L15 but that wasn't breaking on its own.

MichaelChirico commented 1 year ago

The main concern for platform is if this is coming from Windows or not. Definitely surprised this is happening on Ubuntu and hadn't been caught yet! I'll take a look at this soon.

hadley commented 1 year ago

I know literally nothing about this, but this line caught my eye:

"Content-Type: text/plain; charset=ASCII\n"

Would be worth trying chaning ASCII to UTF-8.

maelle commented 1 year ago

@hadley yes, this worked! :tada:

MichaelChirico commented 1 year ago

Thanks @hadley!

Maëlle, can I know how that .po file was generated in the first place? Want to make sure {potools} is not emitting any troublesome headers like that.

MichaelChirico commented 1 year ago

Looks like {potools} can do so, here's how run_msginit() would work:

msginit -i R-pockage.pot -o R-ja.po -l ja -w 120 --no-translator
grep charset R-ja.po
# "Content-Type: text/plain; charset=ASCII\n"
MichaelChirico commented 1 year ago

I don't see an option for msginit to force it to use charset=UTF-8, looks like it's entirely derived from the header metadata in the .pot file:

‘MIME-Version, Content-Type, Content-Transfer-Encoding’

These values are set according to the content of the POT file and the current locale. If the POT file contains charset=UTF-8, it means that the POT file contains non-ASCII characters, and we keep the UTF-8 encoding. Otherwise, when the POT file is plain ASCII, we use the locale’s encoding.

I had hoped using msginit -l ja.UTF-8 ... would do the trick but no such luck.

If I replace charset=CHARSET with charset=UTF-8 in the .pot file, msginit indeed carries that over to the output .po file.

Looking now how safe it may be to default to charset=UTF-8 in .pot files...

MichaelChirico commented 1 year ago

Another note -- looks like there's some conflict b/w po_create() which wraps msginit, vs. write_po_file() which always sets charset=UTF-8:

https://github.com/MichaelChirico/potools/blob/05e873dade3e3d148af5dc4a5b0c5206e6e511b9/R/write_po_file.R#L78-L79

maelle commented 1 year ago

I had created the files using potools. Thank you!