asepaprianto / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

WebURL couldn't parse domain if url is ip address or url include port number #223

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. if url is ip address or url include port number
2.
3.

What is the expected output? What do you see instead?

What version of the product are you using?

Please provide any additional information below.
at least domain could be parsed when url include port number, bellow code 
should be add to WebURL.setURL
//if port exist in url
if(domain.indexOf(":") > 0) domain = domain.substring(0, domain.indexOf(":"));

and if url is ip address, domain can set to "unknow":
if (TLDList.getInstance().contains(domain)) {
    *****
} else if(parts[parts.length - 1].matches("\\d+")) { //url is ip or digit, 
eg:222.121.2.32:8080
    domain = "Unknow";
}

Original issue reported on code.google.com by zuyuan.l...@gmail.com on 2 Jun 2013 at 6:22