dankito / Readability4J

A Kotlin port of Mozilla‘s Readability. It extracts a website‘s relevant content and removes all clutter from it.
Apache License 2.0
145 stars 22 forks source link

[Bug] Characters like äüö are output incorrectly #19

Open jamal2362 opened 3 years ago

jamal2362 commented 3 years ago

Characters like äüö are output incorrectly on some websites. In the German language these characters are often used. In English it does not occur and there is not this problem.

Here is a picture how this looks like on Google. Screenshot_20210627-012434

Here is a screenshot where it is displayed without problems äüö. Screenshot_20210627-013445

dankito commented 3 years ago

I don't think it's a Readability4J issue but that you have to wrap the output in a structure like this to set encoding to UTF-8 (see https://github.com/dankito/Readability4J/issues/2):

<html>
 <head>
  <meta charset="utf-8" /> 
 </head>
 <body>
 <!-- output here -->
 </body>
</html>

This is exactly what article.getContentWithUtf8Encoding() does. Does it work for you?

jamal2362 commented 3 years ago

Hi, Yes i'm using article.getContentWithUtf8Encoding() in the Code.

I have only noticed this strange issue with Google so far. Other pages work fine with äöü co.

Screenshot_20210627-234916__01

michaldvorak79 commented 3 years ago

@jamal2362 Is it possible the website uses a charset other than UTF-8 and you don't take that into account when creating your stringBuffer?

dankito commented 3 years ago

You're right, article.getContentWithUtf8Encoding() didn't take into account document's charset.

Created now the method article.getContentWithDocumentsCharsetOrUtf8() which exactly just does that.

But i don't think that will resolve @jamal2362's issue as above document, google.de, has its charset already set to UTF-8.

Try version 1.0.8 if it solves your issue but i think the issue lies somewhere else.

michaldvorak79 commented 3 years ago

@dankito My apologies, my question was aimed at @jamal2362, sorry if that wasn't clear. I don't think your library does anything wrong. I think the String that's being passed to your library is already wrong, because the code creating the String doesn't check the website encoding.

The same thing actually happened to me and I thought for a while that Readability4J was malfunctioning before realizing it was my own fault :-)

jamal2362 commented 3 years ago

@dankito Thank you for your work! Unfortunately, this did not help. Am I doing something wrong in my code? Do you also have the problems with "google.de" ?

@michaldvorak79 What does that mean exactly? What should I change?

michaldvorak79 commented 3 years ago

@jamal2362 What I mean is this: when you download a web page, you have a byte array, right? But Readability4J requires String. So you have to convert the byte array to String. And for that you need to know the web page character encoding (or "charset"). Whether it's UTF-8 or Windows-1252 or ISO-8859-1 or what. And you have to let Java know which character encoding the byte array uses, otherwise the String will not be created correctly. For example, if you have a webpage that uses the ISO encoding and you convert it into String using the UTF-8 encoding, it will keep regular english characters (as those are the same in both encodings), but it will mangle special characters.

Charset can normally be obtained from the response HTTP headers or it's included in a <meta> tag in the HTML code.

I don't know what your code looks like exactly and how do you obtain the data in your stringBuffer, but my theory was that maybe you always create the data in the stringBuffer as UTF-8 and the websites that give you trouble actually use a different character encoding.

You can check your htmlData variable after you create it and see whether it contains the proper special characters, or whether they are already mangled. If the special characters are good in your htmlData and bad in Readability4J's output, then the library is doing something wrong. If the characters are already mangled in htmlData, then you use the wrong character encoding when turning byte array into String.

codinux-gmbh commented 3 years ago

Can you post your code how you download web page's HTML, Jamal?

Maybe this code helps you:

    val uri = "https://google.de" // set your url here
    val document = Jsoup.parse(URL(uri), 10000)
    val readability = Readability4JExtended(uri, document.outerHtml())

    val article = readability.parse()