asepaprianto / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Can't crawl some websites? #209

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. controller.addSeed("http://forums.sgclub.com/");

What is the expected output? What do you see instead?
It only return one result? How can i make it return more links? the default URL 
is http://www.ics.uci.edu/ is working fine. but when asked to crawl in forum or 
other website doesn't seem to work 

What version of the product are you using?
3.5

Please provide any additional information below.
The only one output i see is 

Docid: 1
URL: http://forums.sgclub.com/
Domain: 'sgclub.com'
Sub-domain: 'forums'
Path: '/'
Parent page: null
Anchor text: null
Text length: 14193
Html length: 149321
Number of outgoing links: 686
Response headers:
    Date: Thu, 21 Mar 2013 02:36:07 GMT
    Server: Apache/2.2.17 (Unix) mod_ssl/2.2.17 OpenSSL/0.9.8e-fips-rhel5 mod_bwlimited/1.4 mod_fcgid/2.3.6
    X-Powered-By: PHP/5.2.17
    Cache-Control: private
    Pragma: private
    X-UA-Compatible: IE=7
    Content-Type: text/html; charset=ISO-8859-1
    Transfer-Encoding: chunked
    Connection: Keep-Alive
    Set-Cookie: sgflastvisit=1363833392; expires=Fri, 21-Mar-2014 02:36:32 GMT; path=/; domain=.sgclub.com
    Set-Cookie: sgflastactivity=0; expires=Fri, 21-Mar-2014 02:36:32 GMT; path=/; domain=.sgclub.com
=============

Original issue reported on code.google.com by tanamo...@gmail.com on 21 Mar 2013 at 2:37

GoogleCodeExporter commented 9 years ago
The page you are trying to crawl is rather slow. Therefore you may see timeouts 
during crawling.

Original comment by acrocraw...@gmail.com on 26 Mar 2013 at 8:43

GoogleCodeExporter commented 9 years ago
Is there any method that can make it faster?

Original comment by careless...@gmail.com on 2 Dec 2013 at 8:44

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:36

GoogleCodeExporter commented 9 years ago
No method can make it faster it is due to the server and the location on earth 
of the client vs server.

But leaving that aside, I am crawling those forums successfully.

No error - everything goes fine.

Please try again and if the problem re-occurs we will try to solve it together

Original comment by avrah...@gmail.com on 20 Aug 2014 at 1:48

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 23 Sep 2014 at 2:07