JvanKatwijk / dab-cmdline

DAB decoding library with example of its use
GNU General Public License v2.0
57 stars 29 forks source link

Display program name characters correctly #27

Closed athoik closed 6 years ago

athoik commented 6 years ago

Hi,

The following I/Q sample has some special characters in program names.

La 1ère BXL La 1ère Wallonie VivaCité

Test Musiq3 +    (6358) is part of the ensemble
BRF 2            (6367) is part of the ensemble
BRF 1            (6366) is part of the ensemble
La 1�re Wallonie (6351) is part of the ensemble
TPEG_PACKET      (data) (E0606361) is part of the ensemble
Musiq3           (6353) is part of the ensemble
La 1�re BXL      (6951) is part of the ensemble
VivaCit�         (6052) is part of the ensemble
Test Classic 21+ (6356) is part of the ensemble
Classic 21       (6354) is part of the ensemble
TARMAC           (6357) is part of the ensemble
Pure             (6355) is part of the ensemble

The é and è are using extended ascii code 130 and 138.

Is there a way to detect way what encoding is used in program name using library or the program should handled it somehow?

Here is a RAW I/Q sample: 20171226_092958_12B.iq 39.1 MB

JvanKatwijk commented 6 years ago

Hi

I am aware of the character issue. The DAB data contains an encoding of the right characterset, however, I am not aware of a decent character handling library in C++, if you have suggestions, I really would appreciate that

best jan

2017-12-26 11:32 GMT+01:00 Athanasios Oikonomou notifications@github.com:

Hi,

The following I/Q sample has some special characters in program names.

La 1ère BXL La 1ère Wallonie VivaCité

Test Musiq3 + (6358) is part of the ensemble BRF 2 (6367) is part of the ensemble BRF 1 (6366) is part of the ensemble La 1�re Wallonie (6351) is part of the ensemble TPEG_PACKET (data) (E0606361) is part of the ensemble Musiq3 (6353) is part of the ensemble La 1�re BXL (6951) is part of the ensemble VivaCit� (6052) is part of the ensemble Test Classic 21+ (6356) is part of the ensemble Classic 21 (6354) is part of the ensemble TARMAC (6357) is part of the ensemble Pure (6355) is part of the ensemble

The é and è are using extended ascii code 130 and 138.

Is there a way to detect way what encoding is used in program name using library or the program should handled it somehow?

Here is a RAW I/Q sample: 20171226_092958_12B.iq 39.1 MB https://mega.nz/#!eY8ykbjY!olCaQY_2x27Bva_8QLUAZODiMP0tNW2YzBtYnLwLMd8

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwAimCTf-72BU2-ejsxO1K9sT_W_gks5tEMs5gaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

JvanKatwijk commented 6 years ago

I looked into it. The issue here is that the DAB specification talks about ebu latin 1 encoding. As you might have guessed, the characters that are not displayed correctly, have the 8 bit on (i.e a variant of ISO 8859) Setting the locale to ..8859 does not help, I can map all characters onto their utf8 equivalent, but it does not seem that the Linux environment processes these utf8 character right

So: looking into it: yes, solution found: not yet

2017-12-26 11:32 GMT+01:00 Athanasios Oikonomou notifications@github.com:

Hi,

The following I/Q sample has some special characters in program names.

La 1ère BXL La 1ère Wallonie VivaCité

Test Musiq3 + (6358) is part of the ensemble BRF 2 (6367) is part of the ensemble BRF 1 (6366) is part of the ensemble La 1�re Wallonie (6351) is part of the ensemble TPEG_PACKET (data) (E0606361) is part of the ensemble Musiq3 (6353) is part of the ensemble La 1�re BXL (6951) is part of the ensemble VivaCit� (6052) is part of the ensemble Test Classic 21+ (6356) is part of the ensemble Classic 21 (6354) is part of the ensemble TARMAC (6357) is part of the ensemble Pure (6355) is part of the ensemble

The é and è are using extended ascii code 130 and 138.

Is there a way to detect way what encoding is used in program name using library or the program should handled it somehow?

Here is a RAW I/Q sample: 20171226_092958_12B.iq 39.1 MB https://mega.nz/#!eY8ykbjY!olCaQY_2x27Bva_8QLUAZODiMP0tNW2YzBtYnLwLMd8

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwAimCTf-72BU2-ejsxO1K9sT_W_gks5tEMs5gaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

athoik commented 6 years ago

Hi,

Maybe we can use this code:

https://github.com/Opendigitalradio/ODR-PadEnc/blob/master/src/charset.h https://github.com/Opendigitalradio/ODR-PadEnc/blob/master/src/charset.cpp

Thanks!

JvanKatwijk commented 6 years ago

Thanks

I'll look into it tomorrow,

jan

2017-12-28 21:55 GMT+01:00 Athanasios Oikonomou notifications@github.com:

Hi,

Maybe we can use this code:

https://github.com/Opendigitalradio/ODR-PadEnc/blob/master/src/charset.h https://github.com/Opendigitalradio/ODR-PadEnc/blob/master/src/charset.cpp

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27#issuecomment-354356566, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwGVx4UWtmlvw8pxpqOgYu-eK1pVCks5tFABagaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

JvanKatwijk commented 6 years ago

These files are just in the wrong direction. The data encoded in DAB uses an 8859-1 encoding. It is fairly easy to translate the characters with bit 8 on to UTF-8, however, - although the box I am using should be able to handle UTF-8 - the problem stays.

I'll try this afternoon on an Ubuntu box.

best jan

2017-12-28 22:09 GMT+01:00 jan van katwijk j.vankatwijk@gmail.com:

Thanks

I'll look into it tomorrow,

jan

2017-12-28 21:55 GMT+01:00 Athanasios Oikonomou notifications@github.com :

Hi,

Maybe we can use this code:

https://github.com/Opendigitalradio/ODR-PadEnc/blob/master/src/charset.h https://github.com/Opendigitalradio/ODR-PadEnc/blob/master/ src/charset.cpp

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27#issuecomment-354356566, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwGVx4UWtmlvw8pxpqOgYu-eK1pVCks5tFABagaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 <+31%2015%20369%208980> +31 (0) 628260355 <+31%206%2028260355>

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

JvanKatwijk commented 6 years ago

Hi

I changed the code in a way that all strings that are output are encoded in utf-8. However, I have a problem in setting the locale to something different than en-US.UTF-8 so I cannot verify the results

best j

2017-12-28 21:55 GMT+01:00 Athanasios Oikonomou notifications@github.com:

Hi,

Maybe we can use this code:

https://github.com/Opendigitalradio/ODR-PadEnc/blob/master/src/charset.h https://github.com/Opendigitalradio/ODR-PadEnc/blob/master/src/charset.cpp

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27#issuecomment-354356566, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwGVx4UWtmlvw8pxpqOgYu-eK1pVCks5tFABagaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

athoik commented 6 years ago

Hi,

I didn't get valid utf8 back, although with the following code everything seems ok here!

diff --git a/library/src/backend/charsets.cpp b/library/src/backend/charsets.cpp
index b9bbbb8..3135c46 100644
--- a/library/src/backend/charsets.cpp
+++ b/library/src/backend/charsets.cpp
@@ -69,6 +69,24 @@ static const unsigned short ebuLatinToUcs2[] = {
 /* 0xf8 - 0xff */ 0xfe,   0x014b, 0x0155, 0x0107, 0x015b, 0x017a, 0x0167, 0xff
 };

+static const char* utf8_encoded_EBU_Latin[] = {
+"\0", "Ę", "Į", "Ų", "Ă", "Ė", "Ď", "Ș", "Ț", "Ċ", "\n","\v","Ġ", "Ĺ", "Ż", "Ń",
+"ą", "ę", "į", "ų", "ă", "ė", "ď", "ș", "ț", "ċ", "Ň", "Ě", "ġ", "ĺ", "ż", "\u0082",
+" ", "!", "\"","#", "ł", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/",
+"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ":", ";", "<", "=", ">", "?",
+"@", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O",
+"P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "[", "Ů", "]", "Ł", "_",
+"Ą", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o",
+"p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "«", "ů", "»", "Ľ", "Ħ",
+"á", "à", "é", "è", "í", "ì", "ó", "ò", "ú", "ù", "Ñ", "Ç", "Ş", "ß", "¡", "Ÿ",
+"â", "ä", "ê", "ë", "î", "ï", "ô", "ö", "û", "ü", "ñ", "ç", "ş", "ğ", "ı", "ÿ",
+"Ķ", "Ņ", "©", "Ģ", "Ğ", "ě", "ň", "ő", "Ő", "€", "£", "$", "Ā", "Ē", "Ī", "Ū",
+"ķ", "ņ", "Ļ", "ģ", "ļ", "İ", "ń", "ű", "Ű", "¿", "ľ", "°", "ā", "ē", "ī", "ū",
+"Á", "À", "É", "È", "Í", "Ì", "Ó", "Ò", "Ú", "Ù", "Ř", "Č", "Š", "Ž", "Ð", "Ŀ",
+"Â", "Ä", "Ê", "Ë", "Î", "Ï", "Ô", "Ö", "Û", "Ü", "ř", "č", "š", "ž", "đ", "ŀ",
+"Ã", "Å", "Æ", "Œ", "ŷ", "Ý", "Õ", "Ø", "Þ", "Ŋ", "Ŕ", "Ć", "Ś", "Ź", "Ť", "ð",
+"ã", "å", "æ", "œ", "ŵ", "ý", "õ", "ø", "þ", "ŋ", "ŕ", "ć", "ś", "ź", "ť", "ħ"};
+
 std::string toStringUsingCharset (const char* buffer,
                                  CharacterSet charset, int size) {
 std::string  s;
@@ -91,11 +109,8 @@ uint16_t i;
           case EbuLatin:
           default:
              for (i = 0; i < length; i++)
-                if (buffer [i] & 0x80) {
-                   uint8_t c0 =  (0xc0 | (((uint8_t)buffer [i]) >> 6));
-                   uint8_t c1 =  ((buffer [i] & 0x3f) | 0x80);
-                   s. push_back (c0);
-                   s. push_back (c1);
+                if (buffer [i] & 0xff) {
+                    s. append (utf8_encoded_EBU_Latin[buffer[i] & 0xff]);
                 }
                 else
                    s. push_back (buffer [i]);
$ dab-raw-3 -F 20171226_092958_12B.iq
dab_cmdline V 1.0alfa,
                      Copyright 2017 J van Katwijk, Lazy Chair Computing
opt = F
ofdm word gestart
Period = 8000
End of file, restarting
there might be a DAB signal here

no ensemble data found, fatal
BRF 1            (6366) is part of the ensemble
La 1ère Wallonie (6351) is part of the ensemble
TPEG_PACKET      (data) (E0606361) is part of the ensemble
End of file, restarting
Classic 21       (6354) is part of the ensemble
ensemble RTBF DAB         is (6005) recognized
Test Musiq3 +    (6358) is part of the ensemble
BRF 2            (6367) is part of the ensemble
TARMAC           (6357) is part of the ensemble
Pure             (6355) is part of the ensemble
Musiq3           (6353) is part of the ensemble
VivaCité         (6052) is part of the ensemble
Test Classic 21+ (6356) is part of the ensemble
La 1ère BXL      (6951) is part of the ensemble
End of file, restarting
^C
JvanKatwijk commented 6 years ago

Great. Thanks,

2018-01-03 15:47 GMT+01:00 Athanasios Oikonomou notifications@github.com:

Hi,

I didn't get valid utf8 back, although with the following code everything seems ok here!

diff --git a/library/src/backend/charsets.cpp b/library/src/backend/charsets.cpp index b9bbbb8..3135c46 100644 --- a/library/src/backend/charsets.cpp +++ b/library/src/backend/charsets.cpp @@ -69,6 +69,24 @@ static const unsigned short ebuLatinToUcs2[] = { / 0xf8 - 0xff / 0xfe, 0x014b, 0x0155, 0x0107, 0x015b, 0x017a, 0x0167, 0xff };

+static const char utf8_encoded_EBU_Latin[] = { +"\0", "Ę", "Į", "Ų", "Ă", "Ė", "Ď", "Ș", "Ț", "Ċ", "\n","\v","Ġ", "Ĺ", "Ż", "Ń", +"ą", "ę", "į", "ų", "ă", "ė", "ď", "ș", "ț", "ċ", "Ň", "Ě", "ġ", "ĺ", "ż", "\u0082", +" ", "!", "\"","#", "ł", "%", "&", "'", "(", ")", "", "+", ",", "-", ".", "/", +"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ":", ";", "<", "=", ">", "?", +"@", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", +"P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "[", "Ů", "]", "Ł", "_", +"Ą", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", +"p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "«", "ů", "»", "Ľ", "Ħ", +"á", "à", "é", "è", "í", "ì", "ó", "ò", "ú", "ù", "Ñ", "Ç", "Ş", "ß", "¡", "Ÿ", +"â", "ä", "ê", "ë", "î", "ï", "ô", "ö", "û", "ü", "ñ", "ç", "ş", "ğ", "ı", "ÿ", +"Ķ", "Ņ", "©", "Ģ", "Ğ", "ě", "ň", "ő", "Ő", "€", "£", "$", "Ā", "Ē", "Ī", "Ū", +"ķ", "ņ", "Ļ", "ģ", "ļ", "İ", "ń", "ű", "Ű", "¿", "ľ", "°", "ā", "ē", "ī", "ū", +"Á", "À", "É", "È", "Í", "Ì", "Ó", "Ò", "Ú", "Ù", "Ř", "Č", "Š", "Ž", "Ð", "Ŀ", +"Â", "Ä", "Ê", "Ë", "Î", "Ï", "Ô", "Ö", "Û", "Ü", "ř", "č", "š", "ž", "đ", "ŀ", +"Ã", "Å", "Æ", "Œ", "ŷ", "Ý", "Õ", "Ø", "Þ", "Ŋ", "Ŕ", "Ć", "Ś", "Ź", "Ť", "ð", +"ã", "å", "æ", "œ", "ŵ", "ý", "õ", "ø", "þ", "ŋ", "ŕ", "ć", "ś", "ź", "ť", "ħ"}; + std::string toStringUsingCharset (const char* buffer, CharacterSet charset, int size) { std::string s; @@ -91,11 +109,8 @@ uint16_t i; case EbuLatin: default: for (i = 0; i < length; i++)

  • if (buffer [i] & 0x80) {
  • uint8_t c0 = (0xc0 | (((uint8_t)buffer [i]) >> 6));
  • uint8_t c1 = ((buffer [i] & 0x3f) | 0x80);
  • s. push_back (c0);
  • s. push_back (c1);
  • if (buffer [i] & 0xff) {
  • s. append (utf8_encoded_EBU_Latin[buffer[i] & 0xff]); } else s. push_back (buffer [i]);

$ dab-raw-3 -F 20171226_092958_12B.iq dab_cmdline V 1.0alfa, Copyright 2017 J van Katwijk, Lazy Chair Computing opt = F ofdm word gestart Period = 8000 End of file, restarting there might be a DAB signal here

no ensemble data found, fatal BRF 1 (6366) is part of the ensemble La 1ère Wallonie (6351) is part of the ensemble TPEG_PACKET (data) (E0606361) is part of the ensemble End of file, restarting Classic 21 (6354) is part of the ensemble ensemble RTBF DAB is (6005) recognized Test Musiq3 + (6358) is part of the ensemble BRF 2 (6367) is part of the ensemble TARMAC (6357) is part of the ensemble Pure (6355) is part of the ensemble Musiq3 (6353) is part of the ensemble VivaCité (6052) is part of the ensemble Test Classic 21+ (6356) is part of the ensemble La 1ère BXL (6951) is part of the ensemble End of file, restarting ^C

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27#issuecomment-355029352, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwDe0JTEZcXzhqkU0GZpRE8lmXY5Zks5tG5L-gaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

athoik commented 6 years ago

Just a note, characters 1 to 127 will be appended directly to string. Although utf8_encoded_EBU_Latin doesn't match the "ascii/iso" ones.

eg "\x01" matches to "Ę" when using EBU_Latin. But "\x01" on ascii translates to ^A (SOH).

So the following still required if I am not mistaken.

diff --git a/library/src/backend/charsets.cpp b/library/src/backend/charsets.cpp
index dcf5221..f030357 100644
--- a/library/src/backend/charsets.cpp
+++ b/library/src/backend/charsets.cpp
@@ -110,11 +110,8 @@ uint16_t i;
           case EbuLatin:
           default:
              for (i = 0; i < length; i++)
-                if (buffer [i] & 0x80) {
-                   if (buffer [i] & 0xff) {
+                if (buffer [i] & 0xff)
                       s. append (utf8_encoded_EBU_Latin [buffer[i] & 0xff]);
-                   }
-                }
                 else
                    s. push_back (buffer [i]);
        }
JvanKatwijk commented 6 years ago

The test was doen twice, the 8-bit test was done first, otherwise the char is added to the buffer directly I'll change it to if (buffer [i] & 0x8F) .... else s. push_back ...

2018-01-03 21:19 GMT+01:00 Athanasios Oikonomou notifications@github.com:

Just a note, characters 1 to 127 will be appended directly to string. Although utf8_encoded_EBU_Latin doesn't match the "ascii/iso" ones.

eg "\x01" matches to "Ę" when using EBU_Latin. But "\x01" on ascii translates to ^A (SOH).

So the following still required if I am not mistaken.

diff --git a/library/src/backend/charsets.cpp b/library/src/backend/charsets.cpp index dcf5221..f030357 100644 --- a/library/src/backend/charsets.cpp +++ b/library/src/backend/charsets.cpp @@ -110,11 +110,8 @@ uint16_t i; case EbuLatin: default: for (i = 0; i < length; i++)

  • if (buffer [i] & 0x80) {
  • if (buffer [i] & 0xff) {
  • if (buffer [i] & 0xff) s. append (utf8_encoded_EBU_Latin [buffer[i] & 0xff]);
  • }
  • } else s. push_back (buffer [i]); }

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27#issuecomment-355116064, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwIlD7OJfgG_Z2L-saqpIsJsPDJnVks5tG-DSgaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

athoik commented 6 years ago

I am sorry once again, but this will work only for positions >= 128.

What about "Ę", (0x01), "Į", (0x02) etc?

It seems that EBU Latin character defines those positions, differently than normal ascii control chars.

JvanKatwijk commented 6 years ago

Hi

As far as I know, ebu latin1 is asci in its first 127 positions, then the special characters? jan

2018-01-04 12:54 GMT+01:00 Athanasios Oikonomou notifications@github.com:

I am sorry once again, but this will work only for positions >= 128.

What about "Ę", (0x01), "Į", (0x02) etc?

It seems that EBU Latin character defines those positions, differently than normal ascii control chars.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27#issuecomment-355264409, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwOcRqYkQG6ozQpmOhZCQIJg6c1C-ks5tHLwSgaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

athoik commented 6 years ago

Then this table is wrong? https://github.com/Opendigitalradio/ODR-PadEnc/blob/master/src/charset.cpp#L38

athoik commented 6 years ago

It seems ok according to: ETSI TS 101 756 v1.8.1. (page 41)

https://worlddabeureka.org/2015/08/03/issue-26-new-latin-based-character-set-for-dab/

http://www.etsi.org/deliver/etsi_ts/101700_101799/101756/01.08.01_60/ts_101756v010801p.pdf

JvanKatwijk commented 6 years ago

well according to ETSI TS 101 756 the table is correct apart from the first two rows that are not specified in 101 756. The characters from 040 .. 177 are the asci set

2018-01-04 14:54 GMT+01:00 Athanasios Oikonomou notifications@github.com:

Then this table is wrong? https://github.com/Opendigitalradio/ODR-PadEnc/ blob/master/src/charset.cpp#L38

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27#issuecomment-355287588, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwIXoZAYYiRX5Se6-ejeF-7ujKM5Wks5tHNgNgaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

athoik commented 6 years ago

I think we should handle EBU Latin separately from ISO Latin and add utf16to8 (eg from utfcpp).

0000    Complete EBU Latin based repertoire - see annex C
0100    ISO Latin Alphabet No. 1 (see ISO/IEC 8859-1 [8]) 
0110    ISO/IEC 10646 [26] using UCS-2 transformation format, big endian byte order 
1111    ISO/IEC 10646 [26] using UTF-8 transformation format

Most probably today most people still use Latin1 :)

JvanKatwijk commented 6 years ago

Well, 8859-1 and ebu share the ASCI subset. The charsets numbers should indicate the charsets used, I never saw anything other than 0. But if you have a suggestion?

best jan

2018-01-04 15:43 GMT+01:00 Athanasios Oikonomou notifications@github.com:

I think we should handle EBU Latin separately from ISO Latin and add utf16to8 (eg from utfcpp).

0000 Complete EBU Latin based repertoire - see annex C 0100 ISO Latin Alphabet No. 1 (see ISO/IEC 8859-1 [8]) 0110 ISO/IEC 10646 [26] using UCS-2 transformation format, big endian byte order 1111 ISO/IEC 10646 [26] using UTF-8 transformation format

Most probably today most people still use Latin1 :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27#issuecomment-355299382, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwPF1lFcBIL9d6cqc2jFpvYURSDE0ks5tHOOSgaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

athoik commented 6 years ago

Hi,

I think the following will be fine, until somebody uses UCS2 encoding.

diff --git a/library/includes/backend/charsets.h b/library/includes/backend/charsets.h
index 4851443..399b481 100644
--- a/library/includes/backend/charsets.h
+++ b/library/includes/backend/charsets.h
@@ -33,8 +33,9 @@
  */
 typedef enum {
     EbuLatin   = 0x00, // Complete EBU Latin based repertoire - see annex C
-    UnicodeUcs2 = 0x06,
-    UnicodeUtf8 = 0x0F
+    IsoLatin    = 0x04, // ISO Latin Alphabet No. 1 (see ISO/IEC 8859-1 [8])
+    UnicodeUcs2 = 0x06, // ISO/IEC 10646 [26] using UCS-2 transformation format, big endian byte order
+    UnicodeUtf8 = 0x0F  // ISO/IEC 10646 [26] using UTF-8 transformation format
 } CharacterSet;

 /**
diff --git a/library/src/backend/charsets.cpp b/library/src/backend/charsets.cpp
index cd8d6db..202421e 100644
--- a/library/src/backend/charsets.cpp
+++ b/library/src/backend/charsets.cpp
@@ -100,21 +100,20 @@ uint16_t i;
           length = size;

        switch (charset) {
-//        case UnicodeUcs2:
-//           s = std::string::fromUtf16 ((const ushort*) buffer, length);
-//           break;
+          case EbuLatin:
+              for (i = 0; i < length; i++)
+                 s. append (utf8_encoded_EBU_Latin [buffer[i] & 0xff]);
+              break;

-          case UnicodeUtf8:
-             break;
+           case UnicodeUcs2:
+              throw std::logic_error("UnicodeUcs2 to Utf8 not yet implemented")
+              break;

-          case EbuLatin:
+          case IsoLatin:
+          case UnicodeUtf8:
           default:
-             for (i = 0; i < length; i++)
-                if (buffer [i] & 0x80) {       // extended char
-                   s. append (utf8_encoded_EBU_Latin [buffer[i] & 0xff]);
-                }
-                else
-                   s. push_back (buffer [i]);
+              for (i = 0; i < length; i++)
+                s. push_back (buffer [i]);
        }

        return s;
JvanKatwijk commented 6 years ago

Sounds a pragmatic approach,

2018-01-04 20:58 GMT+01:00 Athanasios Oikonomou notifications@github.com:

Hi,

I think the following will be fine, until somebody uses UCS2 encoding.

diff --git a/library/includes/backend/charsets.h b/library/includes/backend/charsets.h index 4851443..399b481 100644 --- a/library/includes/backend/charsets.h +++ b/library/includes/backend/charsets.h @@ -33,8 +33,9 @@ */ typedef enum { EbuLatin = 0x00, // Complete EBU Latin based repertoire - see annex C

  • UnicodeUcs2 = 0x06,

  • UnicodeUtf8 = 0x0F

  • IsoLatin = 0x04, // ISO Latin Alphabet No. 1 (see ISO/IEC 8859-1 [8])

  • UnicodeUcs2 = 0x06, // ISO/IEC 10646 [26] using UCS-2 transformation format, big endian byte order

  • UnicodeUtf8 = 0x0F // ISO/IEC 10646 [26] using UTF-8 transformation format } CharacterSet;

    /** diff --git a/library/src/backend/charsets.cpp b/library/src/backend/charsets.cpp index cd8d6db..202421e 100644 --- a/library/src/backend/charsets.cpp +++ b/library/src/backend/charsets.cpp @@ -100,21 +100,20 @@ uint16_t i; length = size;

    switch (charset) {

    -// case UnicodeUcs2: -// s = std::string::fromUtf16 ((const ushort*) buffer, length); -// break;

  • case EbuLatin:

  • for (i = 0; i < length; i++)

  • s. append (utf8_encoded_EBU_Latin [buffer[i] & 0xff]);

  • break;

  • case UnicodeUtf8:

  • break;

  • case UnicodeUcs2:

  • throw std::logic_error("UnicodeUcs2 to Utf8 not yet implemented")

  • break;

  • case EbuLatin:

  • case IsoLatin1:

  • case UnicodeUtf8: default:

  • for (i = 0; i < length; i++)

  • if (buffer [i] & 0x80) { // extended char

  • s. append (utf8_encoded_EBU_Latin [buffer[i] & 0xff]);

  • }

  • else

  • s. push_back (buffer [i]);

  • for (i = 0; i < length; i++)

  • s. push_back (buffer [i]); }

    return s;

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27#issuecomment-355383198, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwJ7svHj7Op6KhZSMtQ7hlh9uauBlks5tHS19gaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

athoik commented 6 years ago

Great!

I create a PR to solve few typos after latest merge.

JvanKatwijk commented 6 years ago

Thanks!

2018-01-05 17:33 GMT+01:00 Athanasios Oikonomou notifications@github.com:

Great!

I create a PR to solve few typos after latest merge.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/27#issuecomment-355600140, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwF9WVTXaix8Y8DPW6jl20aqx8gp7ks5tHk73gaJpZM4RMrRG .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

athoik commented 6 years ago

I guess we are done here, in case a broadcast with UCS-2 appeared we need a UCS2 to UTF8 function ;)