Closed Vanuan closed 10 years ago
Of course, DBCS needs a separate treatment. I just meant to say that one byte encodings has more efficient way of storing the same data. Compare this:
var table = {
"1251":[1026,1027,8218,1107,8222,8230,8224,8225,8364,8240,1033,8249,1034,1036,1035,1039,1106,
8216,8217,8220,8221,8226,8211,8212,152,8482,1113,8250,1114,1116,1115,1119,160,1038,1118,1032,
164,1168,166,167,1025,169,1028,171,172,173,174,1031,176,177,1030,1110,1169,181,182,183,1105,
8470,1108,187,1112,1029,1109,1111,1040,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050,
1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1063,1064,1065,1066,1067,1068,
1069,1070,1071,1072,1073,1074,1075,1076,1077,1078,1079,1080,1081,1082,1083,1084,1085,1086,
1087,1088,1089,1090,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100,1101,1102,1103],
...
}
to the object mapping equivalent. It is at least 4 times less. It just needs a bit of code:
function decode(string, codepage) {
var indexes = table[codepage], decoded = "";
for (var i = 0; i < string.length; ++i) {
if (string[i].charCodeAt(0) < 128) {
decoded += string[i];
} else {
decoded += String.fromCharCode(indexes[string[i].charCodeAt(0) - 128]);
}
}
return decoded;
}
expanding on @Vanuan suggestion, the encode/decode should accept/return either nodejs Buffers or Strings
@Vanuan separate source files are available for individual codepages. For a tight solution, you can string them together. The utils functions from cputils.js work just as well with the individual scripts as with the monster script.
Is there a way to reduce the disk footprint? For example, adding separate scripts for encoding/decoding, using raw characters instead of number strings, using a minifier, etc.
As a side note, it might also be useful to introduce an efficient decoding function, e.g.:
It might be even possible to just use arrays. Index would be the character code. Although it would result in a waste of space for the first 127 characters, so it's comparable to using objects.