Closed lusiaold closed 6 years ago
It's hard to understand what exactly is going on here without looking at the actual strings/buffers. Could you provide a (preferably short) example of what exactly iconv-lite
translates incorrectly. E.g. input buffer, what's being returned, what you expect to be returned.
Hello, thank you for the response and sorry for imprecision.
So here's the one of the webpages that returns bad character: https://apod.nasa.gov/apod/ap170813.html
Take a look at the copyright section, "Vojtech Rušin".
When I request to scrap the site using my API, it returns it like that:
{
"apod_site":"https://apod.nasa.gov/apod/ap170813.html",
"copyright":"Miloslav Druckmüller (Brno U. of Tech.), Martin Dietzel, Peter Aniol, Vojtech Ruin",
"date":"2017-08-13",
"description":"Only in the fleeting darkness of a total solar eclipse is the light of the solar corona easily visible. Normally overwhelmed by the bright solar disk, the expansive corona, the sun's outer atmosphere, is an alluring sight. But the subtle details and extreme ranges in the corona's brightness, although discernible to the eye, are notoriously difficult to photograph. Pictured here, however, using multiple images and digital processing, is a detailed image of the Sun's corona taken during the 2008 August total solar eclipse from Mongolia. Clearly visible are intricate layers and glowing caustics of an ever changing mixture of hot gas and magnetic fields. Bright looping prominences appear pink just above the Sun's limb. A similar solar corona might be visible through clear skies in a thin swath across the USA during a total solar eclipse that occurs just one week from tomorrow.",
"hdurl":"https://apod.nasa.gov/apod/image/1708/corona_druckmuller_1600.jpg",
"media_type":"image",
"title":"Detailed View of a Solar Eclipse Corona",
"url":"https://apod.nasa.gov/apod/image/1708/corona_druckmuller_960.jpg"
}
Notice the empty character or a box character instead of the š.
Hmm, AFAIK there's no š
character in ISO-8859-1 encoding. I think what happens here is that the web page actually uses utf-8.
Well, that's weird, because the webpage itself is encoded in ISO-8859-1 🤨 (at least that's what Chrome tells me)
https://validator.w3.org/nu/?doc=https%3A%2F%2Fapod.nasa.gov%2Fapod%2Fap170813.html
This website gave me pretty interesting results about the real charset used on the NASA website
Warning: Using windows-1252 instead of the declared encoding iso-8859-1.
When I get back home I'll check if it's working when I change iconv's decoding to windows-1252.
Yep, it's working right now, so it was an issue with the website itself, not this library. Thanks for help! 😃
I'm using this piece of code to download a webpage (using
request
library) and decode everything (using youriconv-lite
library). Theloader
function is for finding some elements from the body of the website, then returning them as a JavaScript object.The pages are encoded in ISO-8859-1 format, and the content is looking normal, there are no bad chars. When I wasn't using
iconv-lite
, some characters, eg.ü
, were looking like this: �. Now, when I'm using the library like in the code provided above, most of the chars are looking good, but some, eg.š
are an empty box, even though they're displayed without any problems on the website.I'm sure it's not cheerio's issue, because when I printed the output using
res.send(body);
orres.send(JSON.stringify({"body":body}));
, the empty box character was still present there. If that's important, I copied the empty box character to Google, and it has changed toš
. Also, I tried to change output of Express usingres.charset
but that didn't help.