Open jamal2362 opened 3 years ago
I don't think it's a Readability4J issue but that you have to wrap the output in a structure like this to set encoding to UTF-8 (see https://github.com/dankito/Readability4J/issues/2):
<html>
<head>
<meta charset="utf-8" />
</head>
<body>
<!-- output here -->
</body>
</html>
This is exactly what article.getContentWithUtf8Encoding()
does. Does it work for you?
Hi, Yes i'm using article.getContentWithUtf8Encoding() in the Code.
I have only noticed this strange issue with Google so far. Other pages work fine with äöü co.
@jamal2362 Is it possible the website uses a charset other than UTF-8 and you don't take that into account when creating your stringBuffer
?
You're right, article.getContentWithUtf8Encoding()
didn't take into account document's charset.
Created now the method article.getContentWithDocumentsCharsetOrUtf8()
which exactly just does that.
But i don't think that will resolve @jamal2362's issue as above document, google.de, has its charset already set to UTF-8.
Try version 1.0.8 if it solves your issue but i think the issue lies somewhere else.
@dankito My apologies, my question was aimed at @jamal2362, sorry if that wasn't clear. I don't think your library does anything wrong. I think the String that's being passed to your library is already wrong, because the code creating the String doesn't check the website encoding.
The same thing actually happened to me and I thought for a while that Readability4J was malfunctioning before realizing it was my own fault :-)
@dankito Thank you for your work! Unfortunately, this did not help. Am I doing something wrong in my code? Do you also have the problems with "google.de" ?
@michaldvorak79 What does that mean exactly? What should I change?
@jamal2362 What I mean is this: when you download a web page, you have a byte array, right? But Readability4J requires String
. So you have to convert the byte array to String. And for that you need to know the web page character encoding (or "charset"). Whether it's UTF-8
or Windows-1252
or ISO-8859-1
or what. And you have to let Java know which character encoding the byte array uses, otherwise the String
will not be created correctly. For example, if you have a webpage that uses the ISO encoding and you convert it into String
using the UTF-8
encoding, it will keep regular english characters (as those are the same in both encodings), but it will mangle special characters.
Charset can normally be obtained from the response HTTP headers or it's included in a <meta>
tag in the HTML code.
I don't know what your code looks like exactly and how do you obtain the data in your stringBuffer
, but my theory was that maybe you always create the data in the stringBuffer
as UTF-8
and the websites that give you trouble actually use a different character encoding.
You can check your htmlData
variable after you create it and see whether it contains the proper special characters, or whether they are already mangled. If the special characters are good in your htmlData
and bad in Readability4J's output, then the library is doing something wrong. If the characters are already mangled in htmlData
, then you use the wrong character encoding when turning byte array into String
.
Can you post your code how you download web page's HTML, Jamal?
Maybe this code helps you:
val uri = "https://google.de" // set your url here
val document = Jsoup.parse(URL(uri), 10000)
val readability = Readability4JExtended(uri, document.outerHtml())
val article = readability.parse()
Characters like äüö are output incorrectly on some websites. In the German language these characters are often used. In English it does not occur and there is not this problem.
Here is a picture how this looks like on Google.
Here is a screenshot where it is displayed without problems äüö.