blabber / grawler

a gopherspace crawler
6 stars 1 forks source link

Strange result on some gophers #1

Open g0pherzilla opened 4 years ago

g0pherzilla commented 4 years ago

Hello . Why if the entry point is zaledia.com, the crawler does not find all the links and gets stuck on zenalio.ch? Maybe it depends on the number of threads? That's how the crawler was launched:

./grawler -ilogfile ilog.txt -crawlers 50

The grawler.dot content after the scan is complete:

strict strict digraph {
    "zaledia.com:70" [alive=true] 
    "zaledia.com:70" -> "zaledia.com:70" 
    "zaledia.com:70" -> "zenalio.ch:9999" 
    "zenalio.ch:9999" [alive=true] 
    "zenalio.ch:9999" -> "zenalio.ch:9999." 
}

Would it be more efficient to modify the main.go file by adding reading from an array?

Example:

// Bootstrap the crawling.
var lnks1 [6]string
lnks1[0] = "sdf.org"
lnks1[1] = "gopher.quux.org."
lnks1[2] = "gopher.floodgap.com."
lnks1[3] = "bitreich.org."
lnks1[4] = "uninformativ.de."
lnks1[5] = "gopher.viste.fr."

.....

If there are a lot of references?

# In 2 hours Crawling without parmeters took longer, but the content of the grawler.dot file remained unchanged.

g0pherzilla commented 4 years ago

Prehistory. In 2018, an enthusiast scanned Gopherspace with your tool and created a map. https://ibb.co/m8LWWr3 It's more extensive than you because there were multiple entry points.

blabber commented 4 years ago

Sorry for the late answer. grawler uses a very naive and not very robust approach to crawling and it is possible to be caught in an endless loop. It's difficult to identify the issue without further debugging. And to be honest, I am not that interested in this project anymore, because of this:

I don't think grawler should be used in it's current state. It is not a well behaving crawler. It is not formalized, but gopher holes may restrict crawlers via a robots.txt and grawler is ignoring that. It would also be nice to prevent bursts of request to single gopher holes by delaying additional request in order to spread out the load.

Without this issues being fixed, I would say: Don't use this software.

If someone is interested in fixing this issues, I will happily transfer maintainership/ownership of this project.