antoine-tran / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

Encoding problem (input is interpreted as Latin-1) #23

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Apply boilerpipe-1.1.0 (ArticleExtractor) to a file without explicit 
'charset=' meta. (e.g. 
http://www.slobodnadalmacija.hr/Zadar/tabid/73/articleType/ArticleView/articleId
/140666/Default.aspx) 

What is the expected output? What do you see instead?
Expected: When no further information is available from the input, non-Ascii 
chars are read and written as UTF-8, being the most general and most widely 
used character set. 
Instead: Non-Ascii chars are mis-interpreted as Latin-1 while reading in and 
then written as UTF-8.

What version of the product are you using? On what operating system?
boilerpipe 1.1.0 on Ubuntu Linux 10.04 (locale: en_US.utf8)

Please provide any additional information below.
The problem seems to be corrected in the version of the web interface (cf. URL 
above). So it should be an easy thing to handle.

Original issue reported on code.google.com by tonio.wa...@gmail.com on 14 Jun 2011 at 4:14

GoogleCodeExporter commented 9 years ago
Relying on UTF-8 as the default would be plain wrong.

According to RFC 2616 (HTTP/1.1), ISO-8859-1 is the default charset encoding.
We're already relaxing it to Win Cp1252. 

If you need to change the default encoding for your setup, simply adjust the 
following line in the HTMLFetcher class:
        Charset cs = Charset.forName("Cp1252");

Original comment by ckkohl79 on 7 Jul 2011 at 1:45

GoogleCodeExporter commented 9 years ago
I understand that defaulting to utf-8 could be wrong. 
However, when the source of 
http://www.buddymedia.com/newsroom/2011/06/hearst-magazines-digital-media-partne
rs-with-buddy-media-to-launch-a-scalable-social-platform-on-facebook-for-thirtee
n-hearst-brands/#more-10378 is passed as html string 'a' to 
ast.INSTANCE.getText(a) (where ast is an ArticleExtractor object), it creates 
the same problem. Input seems to be interpreted as Latin-1. How can that be 
fixed or How can I make it default to utf-8 ? 

Thanks.

Original comment by amita...@gmail.com on 2 Aug 2011 at 3:28