ashtuchkin / iconv-lite

Convert character encodings in pure javascript.
MIT License
3.08k stars 282 forks source link

add a method to split a string into an array of encodable and non-encodable substrings #73

Open Mithgol opened 10 years ago

Mithgol commented 10 years ago

I'd like to propose an alternative to #53.

It is supposed in #53 that “invalid” characters (i.e. characters that cannot be encoded with the given encoding) should be dealt with individually. Sometimes, however, it becomes more useful to deal with the whole susbstrings of such characters. For such cases I propose an idea of a method that would split any given string into an array of encodable and non-encodable substrings following each other.

Example:

var iconvLite = require('iconv-lite');
console.log(
   iconvLite.split('Хлѣбъ です。', 'cp866')
); // output: ['Хл', 'ѣ', 'бъ ', 'です。']

The above suggested method is inspired by a behaviour of String.prototype.split when it is given a regular expression enclosed in a single set of capturing parentheses:

console.log(
   'foo-bar'.split(/(-+)/)
); // output: [ 'foo', '-', 'bar' ]
console.log(
   '--foo-bar'.split(/(-+)/)
); // output: [ '', '--', 'foo', '-', 'bar' ]

The proposed method should remind its users of String.prototype.split (hence the name .split) and thus be understood by analogy.

To make a complete similarity, it should also behave similarly, i.e. the even array indices (0, 2, 4…) should always correspond to encodable substrings while the odd array indices (1, 3, 5…) should always correspond to non-encodable substring. (To achieve that, the first substring in the returned array could sometimes be intentionally left blank, like String.prototype.split does it in the [ '', '--', 'foo', '-', 'bar' ] example above, to preserve the meaning of odd and even indices.)

Mithgol commented 10 years ago

Examples of possible usage:

iconvLite.split(someString, someEncoding).length <= 1
iconvLite.split(someString, someEncoding).map(function(substring, index){
   if( index % 2 === 0 ){ // encodable substring: 0th, 2nd, 4th…
      return substring;
   } else { // non-encodable substring: 1st, 3rd, 5th…
      // TODO: use a real transliteration here instead of a mere slugification
      return require('underscore.string').slugify(substring).
   }
}).join('');
ashtuchkin commented 10 years ago

I don't quite understand when it could be useful compared to a callback. Do you think several invalid characters could be handled in a smarter way if handled together?

The analogy to String#split seems a bit odd for me, especially when you need to check index%2 to get if the string is valid or invalid. Moreover, invalid portions will need to be in Buffer-s, as we couldn't convert them to js string.

If there is really a smarter way to handle multiple invalid chars as opposed to single, then I'd suggest using the same callback, but passing multiple invalid bytes to it as a Buffer. What do you think?

Mithgol commented 10 years ago

I have to admit I could have misunderstood what a callback in #53 actually meant.

What I proposed here is a method to deal with non-encodable parts of a Unicode string — not with a non-decodable parts of a Buffer.

Therefore here the even parts (index % 2 === 0) are not the decoded parts of some Buffer, but rather Unicode substrings that may fit to a specified encoding (but they aren't touched yet, just split out of the original string).

Also the odd parts (index %2 !== 0) are not the Buffers that could not be decoded, but rather Unicode substrings containing the characters that are not supported by a specified encoding and thus they won't be encoded.

Does it explain the absence of Buffers and the analogy to String#split?

Mithgol commented 10 years ago

The thought of several invalid characters that could be handled in a smarter way if handled together — that is a thought that I borrowed from Wikipedia's “UTF-7” article, from its section “Encoding”. That's what they say:

A simple encoder may encode all characters it considers safe for direct encoding directly. However, the cost of ending a Unicode sequence, outputing a single character directly in ASCII and then starting another Unicode sequence is 3 to 3⅔ bytes. This is more than the 2⅔ bytes needed to represent the character as a part of a Unicode sequence.

Imagine a medium where most characters are encoded with some default encoding (probably one byte per character, such as CP866 or КOI-8R), but the rest are converted to UTF-7. Such a method (.split) would facilitate a smarter handling of “invalid” characters and even a smarter way of dealing with single “valid” characters that appear surrounded by the “invalid”.

ashtuchkin commented 10 years ago

Ok, now I got you, thanks for the explanation!

Well, this would require additional code that does just this in all codecs. It's rather big time investment for me, plus burden to all future codecs, so I'm a bit reluctant to implement it without a hard use case.

The use case you described above is mostly solved by a callback, with the exception of single valid char surrounded by invalid, which I believe is a rare case (that the format supports and recommends it).

Mithgol commented 10 years ago

What if I wrote a pull request?

I may actually be willing to write it, but that depends on how many codecs iconv-lite has: sbcs-codec.js, dbcs-codec.js, internal.js, utf16.js, any else?

Preliminary note 1. For the Unicode encodings (such as UTF-16) any JavaScript character is encodable and thus .split(inputString, encodingUnicode) would simply return [inputString]. I guess the matter of implementing .split() for the latter of the above four codecs (for utf16.js) is a piece of cake.

Preliminary note 2. One of the remaining codecs (internal.js) supports eight encodings. Five of them ('utf8', 'cesu8', 'unicode11utf8', 'ucs2', 'utf16le') are Unicode encodings and thus .split(inputString, encodingUnicode) would also simply return [inputString] for them. The remaining three encodings ('binary', 'base64', 'hex') have simple regexp character classes that define the supported characters (/[\x00-\xFF]/, /[A-Za-z0-9+\/]/, /[0-9A-Fa-f]/) and thus String#split could be used as iconvLite#split for them (after these classes are negated to produce encodable substrings in even positions instead of odd).

ashtuchkin commented 10 years ago

I'm still not convinced in the usefulness of this functionality and it will not only increase code complexity/size and API surface, but also increase the burden of adding each new encoding, even if you write all the code for now. IMHO it's just not worth it until we find a compelling use case.

Mithgol commented 10 years ago

Well, I hope I have some (more or less compelling) use case.

You know how the UTF-7 was first invented, right? There was (and still somewhere is) a medium (MIME headers) where Unicode was not permitted and thus any characters of a Unicode string either would fit in the other (ASCII) encoding or would have to be encoded (using ASCII characters) and escaped (not to be confused with real ASCII characters).

I have to face another similar medium now. That medium is Fidonet where the design of the most popular message editor (GoldED+) makes any support of the multibyte encodings in GoldED+ impossible. Therefore it is also not possible to simply write Unicode messages to Fidonet and expect them to be ever read by the users of GoldED+; however, if the text is mostly in Russian, it becomes possible to write most of the message in a single-byte encoding (such as CP866), split out the substrings that won't fit, encode them differently and escape them (not to be confused with the rest of message).

If a standard arises for such encoding and escaping, then the other message editors (and mere Fidonet browsers and WebBBS) could collectively embrace and extend GoldED+.

Speaking of the different encoding of such substrings, at first I suggested (there, last paragraph) that it could be Punycode; then Serguei E. Leontiev pointed out that UTF-7 is more compact.

However, before they are differently encoded, these substrings (outside of CP866 range) have to be isolated, split out of the string containing the original (Unicode) text of the message. That's my use case for the .split method suggested above.

Mithgol commented 10 years ago

In a nutshell this use case is a generalization of the UTF-7's use case: the Unicode characters are forced into some 8-bit medium (defined by one of the supported single-byte encodings) instead of UTF-7's original 7-bit medium.

The whole implementation of the use case would look like the following:

var iconvLite = require('iconv-lite');

iconvLite.extendNodeEncodings();

var UnicodeTo8bit = function(sourceString, targetEncoding){
   var buffers = iconvLite.split(
      sourceString, targetEncoding
   ).map(function(substring, index){
      if( index % 2 === 0 ){ // encodable substring: 0th, 2nd, 4th…
         return Buffer(substring, targetEncoding);
      } else { // non-encodable substring: 1st, 3rd, 5th…
         // TODO: define an escaping function
         var escapedString = escapingFunction(
            Buffer(substring, 'utf7').toString('utf8')
         );
         return Buffer(escapedString, targetEncoding);
      }
   });
   return Buffer.concat(buffers);
};
ashtuchkin commented 10 years ago

Ok, but why can't you do this using callbacks? Each range of non-encodable characters will be given to this callback and it can either throw or return a string to replace them, similar to .replace(regexp, function) in javascript. This would also allow transliteration and checking that all characters are encodable. A much more flexible mechanism, which is also fits nice with streaming mode.

When the scheme stabilizes, you can also make it into its own codec. Or I can generalize it and make a meta-codec that will take basic encoding, escaping rules and escaping encoding as parameters.

What do you think?

Alexander Shtuchkin

On Thu, Jul 17, 2014 at 5:08 AM, Mithgol notifications@github.com wrote:

In a nutshell this use case is a generalization of the UTF-7's use case: the Unicode characters are forced to some 8-bit medium (defined by a single-byte encoding) instead of UTF-7's original 7-bit medium.

— Reply to this email directly or view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/73#issuecomment-49298302 .

Mithgol commented 10 years ago

Would such callback be programmed to receive only one character (the next non-encodable character)? or a whole substring of such characters collected until the next encodable character is encountered (or original string ended)?

ashtuchkin commented 10 years ago

I'm thinking about the latter case, that would be more convenient and probably faster too. There is one issue though - as iconv-lite is stream-oriented and working chunk-by-chunk, I would not want to accumulate ranges of invalid characters, because we can run out of memory on very long invalid streams.

So I'm thinking about the following interface:

function unencodableHandler(str, offset, context, rangeStarted, rangeFinished) {
  // str is a string of unencodable characters.
  // offset - index of str in context.
  // context is the whole string (chunk) currently encoded.
  // rangeStarted flag is true when str starts a range of contiguous invalid characters.
  // rangeFinished flag is true when str completes the range.
  // Flags are always true in non-streaming usage.

  // Return characters that can be translated thus far.
  return (rangeStarted ? start_escape : "") + escape(str) + (rangeFinished ? end_escape : "");
}

// Convert a string - all unencodable strings will be complete.
buf = iconv.encode(str, "cp866", {unencodableHandler: unencodableHandler});

// Or a stream
inputStream.pipe(iconv.encodeStream("cp866", {unencodableHandler: unencodableHandler})).pipe(outStream);
Mithgol commented 10 years ago

LGTM.

That context thing would seem more helpful if it was guaranteed to contain some context both before and after the srt (e.g. to decide whether the default start_escape / end_escape values are appropriate and efficient for that context).

For example, as Wikipedia says about UTF-7,

The end is indicated by any character not in the modified Base64 set. If the character after the modified Base64 is a - (ASCII hyphen-minus) then it is consumed by the decoder and decoding resumes with the next character. Otherwise decoding resumes with the character after the base64.

Thus end_escape could be '' by default and '-' if context contains its own '-' after the str.

However, in practice end_escape is going to be '-' by default and become '' only if context does contain some other character (not a minus) after the str. (If context just ends there, it might mean a mere end of the current chunk and there's no guarantees that the next chunk won't start with its own '-'.)

However, that's still better than not having context at all. Good enough.

Mithgol commented 9 years ago

Just a nudge because ≈nine months have passed.

ashtuchkin commented 9 years ago

Sorry, not much free time recently. I do remember about it and will fix eventually.