VIDA-NYU / domain_discovery_tool

This repository contains the Domain Discovery Tool (DDT) project. DDT is an interactive system that helps users explore and better understand a domain (or topic) as it is represented on the Web.
http://domain-discovery-tool.readthedocs.io/en/latest/index.html
GNU General Public License v3.0
47 stars 18 forks source link

Start crawl sends wrong seed to the crawler #53

Open aecio opened 6 years ago

aecio commented 6 years ago

When DDT sends the URL to DDT it is appending a string ,1 to the end of the seed URL. Maybe that string is the count of URLs shown in the recommendations box.

yamsgithub commented 6 years ago

This does not seem to be the case. The following ACHE crawler message when urls are added reiterates this:

[2017-08-03 15:50:34,238] INFO [qtp597874846-15] (FrontierManager.java:236) - Adding 3 seed URL(s)... [2017-08-03 15:50:34,320] INFO [qtp597874846-15] (FrontierManager.java:248) - Added seed URL: http://answers.yahoo.com/dir/index/discover?sid=396545327 [2017-08-03 15:50:34,320] INFO [qtp597874846-15] (FrontierManager.java:248) - Added seed URL: http://answers.yahoo.com/dir/index/discover?sid=396545433 [2017-08-03 15:50:34,321] INFO [qtp597874846-15] (FrontierManager.java:248) - Added seed URL: http://answers.yahoo.com/

aecio commented 6 years ago

This issue is still happening, tough it not always appending ,1. Right now I'm seeing that it appended 1 in the URLs shown in "Crawling View" -> "Deep Crawling" -> "Domains for crawling".