bnoordhuis / node-iconv

node.js iconv bindings - text recoding for fun and profit!
Other
799 stars 123 forks source link

Bab behavior of //TRANSLIT ? #141

Closed NicolasJacob closed 8 years ago

NicolasJacob commented 8 years ago

Hello On unix we have this behavior:

$ echo "élloï “amœboïde € Φ" | iconv -t  ISO885916//TRANSLIT   | iconv  -f ISO885916 -t UTF8
élloï "amœboïde € ?

Notice that 'Φ' is translited to '?'

On node-icon, we have an exception:

> I=require('iconv').Iconv
> conv1 = new I('UTF-8', 'iso-8859-16//TRANSLIT');
> conv1.convert("élloï “amœboïde € Φ")
Error: Illegal character sequence.

Notice that this piece of code does'nt produce suitable results , since I want to see the ? in place of Φ.

...
> conv1 = new I('UTF-8', 'iso-8859-16//TRANSLIT//IGNORE');
> conv2 = new I('iso-8859-2', 'UTF-8');
> conv2.convert(conv1.convert("élloï “amœboïde € Φ")).toString()
'élloï "amœboïde € '

I did't found out how to do this, and in my opinion current implementation //TRANSLIT is not correct, it should dot the same things as on UNIX.

Cheers

Nicolas

bnoordhuis commented 8 years ago

I can confirm the issue but it lies with GNU libiconv, it doesn't have transliteration for that character.

The reason it works for you with the iconv command line tool is that the default iconv tool on Linux is from a different library, glibc, that uses different rules for transliteration.

I feel libiconv's choice is defensible here because the transliteration is not reversible, not even approximately. Compare the result of glibc's iconv, it loses the character completely after a round-trip.

$ echo -ne '\xCE\xA6' | iconv -f utf-8 -t iso-8859-16//translit | iconv -f iso-8859-16 -t utf-8 | xxd 
00000000: 3f                                       ?

(It's the //translit that turns it into a '?' but I'm trying to make a point about the lossiness of the round-trip.)

If you strongly feel that's the correct thing to do, please file an issue with the upstream libiconv project. I'll close the issue, this is outside node-iconv's power to change.