[Question] How to decode more than one encoding

Kikobeats commented 6 years ago

Hello,

Thanks for the library, it's very helpful 🙏 .

I'm afraid to do something wrong using it and I want to as openly to ask for advice.

I'm using icon-lite for decoding HTML. I created html-encode for that purpose, and normally we are interested in getting UTF8 string.

My concern is about Base64 HTML encoding entities.

Let put we have a simple HTML like that:

<html><head></head<body><pre style="word-wrap:break-word;white-space:pre-wrap;">&lt;!DOCTYPE html&gt;
&lt;html lang="en"&gt;
&lt;head&gt;
  &lt;meta charset="UTF-8"&gt;
  &lt;meta name="viewport" content="width=device-width, initial-scale=1.0"&gt;
  &lt;meta http-equiv="X-UA-Compatible" content="ie=edge"&gt;
  &lt;title&gt;Document&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
  &lt;a href="https://httpbin-org.herokuapp.com/redirect/3"&gt;&lt;/a&gt;
&lt;/body&gt;
&lt;/html&gt;</pre></body></html>

You can see two things into this markup:

The HTML charset is UTF-8.
The HTML entities are encoded using Base64.

Because Base64 is ascii, using the library I expect have a decode HTML doing something like:

const decodedHtml = Buffer.from(iconv.decode(buffer, 'ascii'))

and then I use the otuput as input for decoding again into the target encoding, in this case, UTF-8:

iconv.decode(decodedHtml, 'utf-8')

The order is important; if I do ascii conversion as final step, the output is not the expected.

The thing I feel afraid is that doing ascii conversion first could be decode something related with the target charset.

I want to ask, do you think is it a good workflow, or shoul I delegate into specific base64 html entitites libraries, such as he?

ashtuchkin commented 6 years ago

Hey Kiko, glad you like the library. I'm actually not really sure what "base64 html entities" you're talking about. There's regular HTML entities, but they are not base64 encoded - they use either a string or a number between "&" and ";", e.g:

HTML Entity	What it means	Decoded value
`<`	"less than" character	`<`
`>`	"greater than" character	`>`
`©`	"copy" character	`©`
`†`	unicode character number 8212	`†`

All HTML entities are ASCII, which is a proper subset of UTF-8, so there shouldn't be a problem decoding in either order. I'd still recommend decoding UTF-8 first to keep a clear process and mental model:

You usually start with bytes that you download from the website, plus encoding. Bytes in JS are represented by Buffer or Uint8Array. If you're getting a string from the download process - it has already been decoded and you need to fix that (likely by providing "encoding: null" or something).
Then, you decode these bytes using iconv.decode and get a JS string. JS string contains unicode characters and you can work with it using all JS operations like search/replace, regex, etc.
HTML Entities are a higher-level concept, so if you want to "decode" them, you'll have to use a different library or regexps on the JS string you got in the previous step. Note, decoding them would likely mess up the HTML structure, so I'd recommend doing this after fully parsing HTML and creating a DOM tree.

rejas commented 3 years ago

Hi there, we have a maybe similar issue open in the MagicMirror repo: https://github.com/MichMich/MagicMirror/issues/2712

The problem is that the input contains html-entitries like "ö" for the german "ö" which isnt decoded by iconv-lite. If I understand this issue here and your reasing @ashtuchkin, then the MagicMirror code should handle those chars since they are not part of the encoding itself?

ashtuchkin commented 3 years ago

Yeah decoding html entities is outside of iconv-lite scope. Maybe there's another library that can do that?

rejas commented 3 years ago

Thx for the clarification (and of course your library). As it turned out, the nunjuck templating used was the culprit for our issue :-)

ashtuchkin / iconv-lite

[Question] How to decode more than one encoding #207