ashtuchkin / iconv-lite

Convert character encodings in pure javascript.
MIT License
3.07k stars 282 forks source link

Note on adding a table without modifying the code #226

Closed btsimonh closed 4 years ago

btsimonh commented 4 years ago

Hi,

I had reasons to need to convert BIG5 coded DBCS, but include private use characters in the conversion (0xfa40+ -> \UE000+). Since the 'big5' table here is actually HKSCS, which conflicts with my codes, I needed to create a new DBCS table, and thought I would document the process here.

In order to add a DBCS table based on another, you need to do a few things: 1/ you need to call iconv.getCodec(); so that iconv.encodings exists. 2/ create a table (or extra parts you want to add to a table). 3/ create a new encoding definition (like in dbcs-data.js). Note now I based it on a previous table (cp950) without having to directly require the relevant table file - requiring was difficult because of paths. 4/ Add the new defn directly to iconv.encodings. 5/ use your sparkly new table :).

Example code snippet:

var iconv = require('iconv-lite');

var private = [
    ["fa40","\ue000", 62],
    ["faa1","\ue03f", 93],
    ["fb40","\ue09d", 62],
    ["fba1","\ue0dc", 93],
    ["fc40","\ue13a", 62],
    ["fca1","\ue179", 93],
    ["fd40","\ue1d7", 62],
    ["fda1","\ue216", 93],
    ["fe40","\ue274", 62],
    ["fea1","\ue2b3", 93],
];

try {
    iconv.getCodec(); // if you get ANY named table here, then you won't except.
} catch(e) {
    // ignore
    console.log('ignored:', e);
}
var big5pua = {
    type: '_dbcs',
    table: function() {
        var tab = iconv.encodings['cp950'].table();  
        return tab.concat(private);
    },
    encodeSkipVals: [0xa2cc, 0xa2ce],
};

iconv.encodings['big5pua'] = big5pua;

// test our two duplicate characters and the first PUA character
const buf = Buffer.from('fa4020fa7efaa120fafefb4020fb7efba120fbfe20fefe20a2cca451a2cea4ca', 'hex');
const str = iconv.decode(buf, 'big5pua');
const buf2 = iconv.encode(str, 'big5pua');
console.log('src:',buf);
console.log('string:['+str+']');
var be = Buffer.from(str, 'utf16le').swap16();
console.log('string in utf16be:', be);
console.log('back to big5:',buf2);
/////////////////////////////////////////////////////////////////////////

Result is:

src: <Buffer fa 40 20 fa 7e fa a1 20 fa fe fb 40 20 fb 7e fb a1 20 fb fe 20 fe fe 20 a2 cc a4 51 a2 ce a4 ca>
string:[      十十卅卅]
string in utf16be: <Buffer e0 00 00 20 e0 3e e0 3f 00 20 e0 9c e0 9d 00 20 e0 db e0 dc 00 20 e1 39 00 20 e3 10 00 20 53 41 53 41 53 45 53 45>
back to big5: <Buffer fa 40 20 fa 7e fa a1 20 fa fe fb 40 20 fb 7e fb a1 20 fb fe 20 fe fe 20 a4 51 a4 51 a4 ca a4 ca>

Note: I am not suggesting this is a good rendition of plain old Big5 - more work to do to analyse that, but it illustrates how to abuse iconv-lite to do 'special' encodings without forking. Note2: If (like me) you have a large and complex project where this code COULD exist more than once with different values, or the values are dynamic in some way, be aware of caching inside iconv-lite.

It may be a nice mod to make getCodec(undefined) not except, or provide another function which loads the encodings, but not necessary. Calling encodingExists(undefined) would actually do this. Also the addition of a cache reset function would be nice, although i'm sure setting iconv._codecDataCache = {} would work fine.

Maybe this could go in the wiki?

Simon

ashtuchkin commented 4 years ago

Thanks for writing this; I've added your description at https://github.com/ashtuchkin/iconv-lite/wiki/Modifying-encoding-tables.

As for making getCodec(undefined) not raise exceptions - I think you're right, it's better to provide a separate function to load the encodings. I'll think how to implement that.

btsimonh commented 4 years ago

My pleasure, and thanks for a nice repo :).