ashtuchkin / iconv-lite

Convert character encodings in pure javascript.
MIT License
3.08k stars 282 forks source link

This library doesn't decode everything properly #206

Closed lusiaold closed 6 years ago

lusiaold commented 6 years ago

I'm using this piece of code to download a webpage (using request library) and decode everything (using your iconv-lite library). The loader function is for finding some elements from the body of the website, then returning them as a JavaScript object.

    request.get({url: url, encoding: null}, function(error, response, body) {
            // if webpage exists, process it, otherwise throw 'not found' error
            if (response.statusCode === 200) {
              body = iconv.decode(body, "iso-8859-1");
              const $ = cheerio.load(body);
              async function show() {
                var data = await loader.getDay($, date, html_tags, thumbs, res, image_thumbnail_size);
                res.send(JSON.stringify(data));
              }
              show();
            } else {
              res.status(404);
              res.send(JSON.stringify({"error":"No content for this date."}))
            }
          });

The pages are encoded in ISO-8859-1 format, and the content is looking normal, there are no bad chars. When I wasn't using iconv-lite, some characters, eg. ü, were looking like this: �. Now, when I'm using the library like in the code provided above, most of the chars are looking good, but some, eg. š are an empty box, even though they're displayed without any problems on the website.

I'm sure it's not cheerio's issue, because when I printed the output using res.send(body); or res.send(JSON.stringify({"body":body}));, the empty box character was still present there. If that's important, I copied the empty box character to Google, and it has changed to š. Also, I tried to change output of Express using res.charset but that didn't help.

ashtuchkin commented 6 years ago

It's hard to understand what exactly is going on here without looking at the actual strings/buffers. Could you provide a (preferably short) example of what exactly iconv-lite translates incorrectly. E.g. input buffer, what's being returned, what you expect to be returned.

lusiaold commented 6 years ago

Hello, thank you for the response and sorry for imprecision.

So here's the one of the webpages that returns bad character: https://apod.nasa.gov/apod/ap170813.html

Take a look at the copyright section, "Vojtech Rušin".

When I request to scrap the site using my API, it returns it like that:

{
   "apod_site":"https://apod.nasa.gov/apod/ap170813.html",
   "copyright":"Miloslav Druckmüller (Brno U. of Tech.), Martin Dietzel, Peter Aniol, Vojtech Rušin",
   "date":"2017-08-13",
   "description":"Only in the fleeting darkness of a total solar eclipse is the light of the solar corona easily visible. Normally overwhelmed by the bright solar disk, the expansive corona, the sun's outer atmosphere, is an alluring sight. But the subtle details and extreme ranges in the corona's brightness, although discernible to the eye, are notoriously difficult to photograph. Pictured here, however, using multiple images and digital processing, is a detailed image of the Sun's corona taken during the 2008 August total solar eclipse from Mongolia. Clearly visible are intricate layers and glowing caustics of an ever changing mixture of hot gas and magnetic fields. Bright looping prominences appear pink just above the Sun's limb. A similar solar corona might be visible through clear skies in a thin swath across the USA during a total solar eclipse that occurs just one week from tomorrow.",
   "hdurl":"https://apod.nasa.gov/apod/image/1708/corona_druckmuller_1600.jpg",
   "media_type":"image",
   "title":"Detailed View of a Solar Eclipse Corona",
   "url":"https://apod.nasa.gov/apod/image/1708/corona_druckmuller_960.jpg"
}

Notice the empty character or a box character instead of the š.

ashtuchkin commented 6 years ago

Hmm, AFAIK there's no š character in ISO-8859-1 encoding. I think what happens here is that the web page actually uses utf-8.

lusiaold commented 6 years ago

Well, that's weird, because the webpage itself is encoded in ISO-8859-1 🤨 (at least that's what Chrome tells me)

lusiaold commented 6 years ago

https://validator.w3.org/nu/?doc=https%3A%2F%2Fapod.nasa.gov%2Fapod%2Fap170813.html

This website gave me pretty interesting results about the real charset used on the NASA website

Warning: Using windows-1252 instead of the declared encoding iso-8859-1.

When I get back home I'll check if it's working when I change iconv's decoding to windows-1252.

lusiaold commented 6 years ago

Yep, it's working right now, so it was an issue with the website itself, not this library. Thanks for help! 😃