SpiderCron is still under error

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. croning links from various bbses
2. not reproducing all the time, because some bbs sytem go down casually
3. generally are deadline exceeded error. some casual request aborted.

What is the expected output? What do you see instead?
no more cron failures, links can be updated in time.
29% cron failures at current.

Please use labels and text to provide additional information.
find algorithms to test the health of target server, and harvestlinks 
accordingly

Original issue reported on code.google.com by zinking3 on 14 Oct 2010 at 2:01

GoogleCodeExporter commented 9 years ago

Failed to open following url http://www.lilacbbs.com/rssi.php?h=1 of school: hit
Failed to open following url http://bbs.ruc.edu.cn/wForum/topten.php of school: 
ruc
failed to parse bbs by RE parser; schoolname= fudan
failed to parse required content SITE structure changed; schoolname= fudan
Failed to open following url 
http://bbs.tju.edu.cn/TJUBBSFPKEHPUMNSALVFGWTYHVMRLBXCBIYPKFA_A/bbstop10 of 
school: tju
Failed to open following url http://bbs.whu.edu.cn/rssi.php?h=1 of school: whu
DEADLINE EXCEEDED

Failed to open following url http://www.lilacbbs.com/rssi.php?h=1 of school: hit
Failed to open following url http://bbs.ruc.edu.cn/wForum/topten.php of school: 
ruc
failed to parse bbs by RE parser; schoolname= fudan
failed to parse required content SITE structure changed; schoolname= fudan
Failed to open following url 
http://bbs.tju.edu.cn/TJUBBSFPKEHPUMNSALVFGWTYHVMRLBXCBIYPKFA_A/bbstop10 of 
school: tju
Failed to open following url http://proxy3.zju88.net/agent/top10.do of school: 
zju
Failed to open following url http://bbs.whu.edu.cn/rssi.php?h=1 of school: whu
DEADLINE EXCEEDED

Original comment by zinking3 on 19 Oct 2010 at 5:35

GoogleCodeExporter commented 9 years ago

fixed dom change of FDU HIT
#new problems broughted about
1. scraped infomation are seemingly more instructed, info structure should be 
improved or restructure or abstracted
-->stormed: store these infomation in serialized dict would be better choice, 
missed keys could be restored by default values.
-->integrated with site framework redone. maybe in the next two versions.
2. the parser engine should be improved to report site structure changes(only 
structure change) and site inavailabe for long time.
!important.

Original comment by zinking3 on 19 Oct 2010 at 12:38

Changed state: Started

Attachments:

FIX-2010-10-19.txt

crossin / bgweb

SpiderCron is still under error #3