Letractively / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler
0 stars 0 forks source link

Error crawling url's containing non latin-1 characters: reported containing fatal errors #21

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. svn up
2. install harvestman
3. run: "harvestman -C config.xml"

What is the expected output? What do you see instead?

Expected: all links should be parsed.

Actual: urls containing non Latin-1 characters are reported as "broken links":

[21:45:44] HarvestMan mirror oprea completed in 59.18 seconds.
[21:45:44] 300 links scanned in 1 server .
[21:45:44] 167 files written.
[21:45:44] 18 links had fatal errors and failed to download.
[21:45:44] 12693119  bytes received at the rate of 214.483 KB/sec .

What version of the product are you using? On what operating system?

version 83, Ubuntu 8.04, x86_64

Please provide any additional information below.

The links are not reported to be broken if I use a separate python file
where I state at the begining:

reload(sys)
sys.setdefaultencoding("utf8")

The file is a modified version of the htmlcrawler.py from apps/samples

I suggest using "utf-8" as the default encoding for the entire crawler to
address all the problems at once.

Original issue reported on code.google.com by andrei.p...@gmail.com on 22 Jul 2008 at 6:49

GoogleCodeExporter commented 9 years ago
Forgot to attach the config.xml file. Check attachment. 

Original comment by andrei.p...@gmail.com on 22 Jul 2008 at 6:51

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:25

GoogleCodeExporter commented 9 years ago
Which website is causing the problem. Could you provide a sample url that is 
causing
it to be marked as broken?
Thanks,
Lucas

Original comment by szybal...@gmail.com on 7 Oct 2008 at 2:36

GoogleCodeExporter commented 9 years ago
This is fixed in revision 146. I have added the following lines in config.py in
<set_system_params> function.

reload(sys)
sys.setdefaultencoding('utf8')

The config file for this issue has been added as 
havestman/bugs/config-issue21.xml .

Apparently the site module removes the "setdefaultencoding" attribute from "sys"
after Python starts, so it becomes necessary to reload it to call the function.

Thanks for reporting the bug, andrei!

Original comment by abpil...@gmail.com on 11 Oct 2008 at 9:08