abhishekbhalani / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
1 stars 0 forks source link

Encoding issue #123

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Process the site with Windows-1251 encoidng (for example applico.ru)
2. Get the ShouldCrawlPageLinks event's crawledPage.RawContent or 
crawledPage.HtmlDocument.DocumentNode.OuterHtml values
3. Strings contain many "??????" substrings

What is the expected output? What do you see instead?
Strings should contain UTF encoded characters. They contain "?" characters 
instead 

What version of the product are you using? On what operating system?
Version Abot v1.1.1.0, 2012

Please provide any additional information below.
The issue can be resolved by changing the PageRequester.GetRawHtml method:
            try
            {
                Encoding encoding = Encoding.GetEncoding(response.CharacterSet);
                using (StreamReader sr = new StreamReader(response.GetResponseStream(), encoding))
                {
                    rawHtml = sr.ReadToEnd();
                    sr.Close();
                }
            }

Original issue reported on code.google.com by elisy....@gmail.com on 3 Jan 2014 at 7:34

GoogleCodeExporter commented 9 years ago
This issue was fixed in issue 112 on abot v1.2.3 which has not yet been 
released yet. Merging with fixed issue 112. 

Commit: 
https://github.com/sjdirect/abot/commit/693274ced34b20b6ba7d3642734df6852a546ecd

File Added:
https://github.com/sjdirect/abot/blob/693274ced34b20b6ba7d3642734df6852a546ecd/A
bot/Core/WebContentExtracter.cs

However, thank you for reporting this bug.

Original comment by sjdir...@gmail.com on 3 Jan 2014 at 3:06