UTF-8 compatible export for BeerXML

Brewtarget / brewtarget

Main brewtarget source code repository.

GNU General Public License v3.0

312 stars 135 forks source link

UTF-8 compatible export for BeerXML #624

Closed bebenlebricolo closed 2 years ago

bebenlebricolo commented 2 years ago

This might be contained within another Issue already, but I discovered that exporting in beerXML format breaks some character sets. For instance, I'm using the french alphabet with accents and special characters ; after writing a beer recipe and exporting it to beerXML, I found that in the XML itself I had issues with the encoding ISO-8859-1 as it encodes only the first 256 characters available in UTF-8 (so no accents). Another difference is that UTF-8 is encoded on two octets whereas ISO-8859-1 only uses one.

Misencoding for western languages (?)

The Xml header indicates :

<?xml version="1.0" encoding="ISO-8859-1"?>

and further down the file I have (opening the file in UTF-8 mode)

<MASH_STEP>
     <NAME>Emp�tage</NAME>
</MASH_STEP>

"Empâtage" is the french translation of "Mashing", but accents are not preserved it seems. Another example :

 Mandarina Bavaria � l'�bullition pendant 10,000 min

Expected : "Mandarina Bavaria à l'ébullition pendant 10.000 min"

Reopening the file using the ISO-8859-1 encoding gives the following results :

    <DIRECTIONS>Mettez 20,000 g Mandarina Bavaria ï¿½ l'ï¿½bullition pendant 10,000 min.</DIRECTIONS>

This encoding splits each special character into 2 bytes and tries to decode it with the ISO-8859-1 character set, so it does not account for the extra characters.

Local configuration

BrewTarget version : 2.3.1 (arch linux x86_64 bits, using arch linux AUR via yay)

I've updated the brewtarget version to the latest available version in arch AUR (2.3.1-3) and the issue still appears to be there.

matty0ung commented 2 years ago

Thanks for the detailed description. This is indeed not the behaviour we would want.

I think there are a couple of improvements we could make in the code.

Firstly, we're probably not doing the right thing when we write strings out to the XML file, even in the newest version of the code. It should be straightforward to get Qt to transcode our strings to ISO-8859-1.

Secondly, BeerXML is a bit odd in that it requires us to use ISO-8859-1 (see "XML Header" section of http://www.beerxml.com/beerxml.htm). This does support some accents etc (including most of the ones you'd use in French) but it's not great for all languages. We could at least consider allowing export in UTF-8, ie give you the option to create something that is valid XML, and would probably be read OK by programs wanting to parse BeerXML, even though it's not strictly in compliance with BeerXML 1.0.

I'll try to have a look at this in the next few weeks. (Bit distracted by working on things that will enable us to support BeerJSON, but hope to reach a natural break-point on that soon.)

bebenlebricolo commented 2 years ago

Thanks for your quick reply ! I've browsed through the code and found the BeerXML.h/cpp couple of files with some detailed explanations around the xml formatting. That's quite a downgrade over the regular xml construction, the one being imposed by BeerXML. I followed the link to beerXML.com, and read about the header.

It effectively states :

Per the XML standard, all files should begin with the following header line as the first line. [...] Required XML Header Example with Recipes tag: <?xml version="1.0" encoding="ISO-8859-1"?>

But I'm wondering how to interprete those statements. Maybe it's not the place to talk about this (in such a case, I apologize for the noise :smile: ), but could one interprete the statement as "we need the xml's header line as per xml standard, which may looks like this: <?xml version="1.0" encoding="ISO-8859-1"?> ? If such an hypothesis is correct, then we might use the more traditional <?xml version="1.0" encoding="UTF-8"?> , to the cost of slightly heavier files (personnaly, I don't mind the extra bytes :smile: )

It seems that the wikipedia page of BeerXML supports this hypothesis. I suppose it's not as simple as changing the header encoding of the file itself, and some more development is required (with QString encoding as you mentioned) but at least we might be able to provide a wider version of beerXML encoding after all.

Regarding the BeerJSON, that's really nice to read ! I was wondering about this myself some hours ago, that could be a nice feature to have in order to share some recipes away.

matty0ung commented 2 years ago

As you'll have seen in the comments in BeerXml.cpp, there are several respects in which the BeerXML 1.0 standard is flawed, including that it is not actually valid XML. Since the standard is no longer actively maintained, we can't get definitive answers on how to interpret the bits that seem badly worded or plain wrong.

The sample files I found all had the ISO-8859-1 version of the XML Declaration, so I was reticent to just change it. I didn't want us to be writing BeerXML that other software would not be able to import. That said, if we were able to determine that most/all other software is, in fact, happy to read BeerXML with UTF-8 encoding (ie everyone else takes the same broad reading of the standard as the authors of the Wikipedia page), then perhaps we'd be OK to switch to using UTF-8 ourselves. This is something we should first get input from @mikfire on though. He knows more of the history here.

In the long run, I think a lot of folks want BeerJSON to supplant BeerXML. (AIUI, the data model of BeerJSON is based on a draft BeerXML 2.0 standard that never got off the ground.) One good thing with BeerJSON is that it is defined by its published schema, so it's less ambiguous than BeerXML and much easier to validate.

mikfire commented 2 years ago

You just had to summon me.

I want beerXML to die. In flames. I will laugh and dance with joy around its funeral pyre.

The ISO-8859-1 encoding was here before I was, and I never thought about. Historically, beerxml is a mess and nobody really generates valid code. Each project sort of adjusts to consume the output of the others. I've some bad history here, so I am not going into details.

I have no real guidance on this. It seems a reasonable change, but I simply do not understand the effects enough to say more.

matty0ung commented 2 years ago

Thanks @mikfire!

@bebenlebricolo, I have had a quick look at the code and have some good news for you. I see now that I actually already coded a fix for the ISO-8859-1 encoding bug(!). It's part of the database mega-change that is waiting approval from @mikfire (see https://github.com/Brewtarget/brewtarget/pull/617/). Essentially, as part of that fix, I added out.setCodec(QTextCodec::codecForMib(CharacterSets::ISO_8859_1_1987)); to xml/BeerXml.cpp. I believe this should remedy your original problem with accents.

I think @mikfire is soon going to do the merge for https://github.com/Brewtarget/brewtarget/pull/617/, so once that happens, grab the latest Brewtarget code and give it a try. (If you're really impatient, you could install https://github.com/Brewken/brewken, which already has the fix merged - see line 736 of https://github.com/Brewken/brewken/blob/develop/src/xml/BeerXml.cpp. Brewken will ultimately be "Brewtarget plus features for microbreweries" but at the moment it's where I've been working on Database/BeerJSON stuff before backporting to Brewtarget.)

bebenlebricolo commented 2 years ago

Neat ! This might work, I'll test when I have time. You are right about the accents, I reopened the same file with the ISO-8859-1 encoding and was able to write all the accents and special characters I wondered about via vscode, so it seems to be supported indeed. Note to myself : I suppose ISO-8859-1 encodes accents using 2 characters ligatures, ie to form an "è", we need to serialize an "`" and an "e" consecutively, and the text renderer should be able to process it and output the "è" accordingly.

Thanks @matty0ung and @mikfire for your replies, I'll test the new versions when available or even test the feature on your branch directly, I suppose brewtarget is quite straightforward to build.

matty0ung commented 2 years ago

No problem. :+1:

BTW, not that it matters, but, since you mention it, ISO-8859-1 does not use 2 character ligatures. It's a single-byte character set -- see https://en.wikipedia.org/wiki/ISO/IEC_8859-1 (ou https://fr.wikipedia.org/wiki/ISO/CEI_8859-1 si tu préfères en français!) It's all hangovers from the days of pain before Unicode (and its encodings such as UTF-8) unified the world. There is a whole world of detail and pain that you could get into in investigating these things, but largely of historical interest. Eg modern standards such as JSON are pretty much all UTF-8 so we (probably) won't have to think about character encodings when we do BeerJSON.

bebenlebricolo commented 2 years ago

Hi (again) Tested with a local build from the latest commit on develop 33a5fcc3ba4dbb9c27da93703783d7d07f7b922a (merge commit for PR #617 ) and it seems to be working. Reopening the generated xml file with the right encoding (ISO-8859-1) in vscode proved to solve the issue as text is encoded as expected (with the few accents I've tested, at least).

@matty0ung thanks for the insights, I should have had a look to the specification before guessing in the wild :sweat_smile: ! So it raises another question, which is off-topic and already answered by all the above comments : only a restricted character set is then supported by beerXML, we seem to be the lucky ones (in France) with the right character sets for our accents, but other languages won't be supported, like spanish for instance with their special "ñ" which won't be supported.

This leads to the point that a beerJSON implementation will fill all those encoding gaps (and all the pain will hopefully disappear, at least the ones caused by languages/character set compatibilities).

I'm closing this issue as the original subject was addressed by the #617 PR. Thanks both of you for your time !