heritrix Search Results

583 results
for heritrix

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

jagharsh/crawler4j #128

crawler4j ignores robots.txt

``` What steps will reproduce the problem? 1.put your robots.txt to http://localhost/robots.txt with these lines: User-agent: * Disallow: / 2.crawl some page of localhost 3.you will get the contents…

GoogleCodeExporter updated 9 years ago
3
khuongduyit/crawler4j #139

url contains '\'

``` What steps will reproduce the problem? 1. if the urls contain '\' example: http://www.lngs.gov.cn/newFormsFolders\LNGS_FORMS_633800715869843750XQJ.doc the browser can recognizes the url What i…

GoogleCodeExporter updated 9 years ago
3
iipc/openwayback #27

IDN support

**S'sheet line:** 10 **For whom?** BNF, DN **Notes:** CDX/indexing consequences? Need a test case. Heritrix issues, maybe just H1, so need H1 and H3 test cases. **Est. Milestone:** 2.x.x

PsypherPunk updated 9 years ago
7
dwachira/crawler4j #139

url contains '\'

``` What steps will reproduce the problem? 1. if the urls contain '\' example: http://www.lngs.gov.cn/newFormsFolders\LNGS_FORMS_633800715869843750XQJ.doc the browser can recognizes the url What i…

GoogleCodeExporter updated 9 years ago
3
xxqcheers/crawler4j #139

url contains '\'

``` What steps will reproduce the problem? 1. if the urls contain '\' example: http://www.lngs.gov.cn/newFormsFolders\LNGS_FORMS_633800715869843750XQJ.doc the browser can recognizes the url What i…

GoogleCodeExporter updated 9 years ago
3
lidoapps/jhc-code #11

一件商品抓取时错误，但是在测试中该商品能通过测试。

``` http://item.taobao.com/item.htm?id=13619643886 这件商品的信息抓取的时候出现如下乱码问题，导致无法抓�� ，但是将这个网页导到java工程中去测试，测试能通过。错误代码如下： 2012-06-03T15:23:06.025Z -5 112753 http://item.taobao.com/item.htm?id=13…

GoogleCodeExporter updated 9 years ago
2
sageone/crawler4j #139

url contains '\'

``` What steps will reproduce the problem? 1. if the urls contain '\' example: http://www.lngs.gov.cn/newFormsFolders\LNGS_FORMS_633800715869843750XQJ.doc the browser can recognizes the url What i…

GoogleCodeExporter updated 9 years ago
3
xxqcheers/crawler4j #128

crawler4j ignores robots.txt

``` What steps will reproduce the problem? 1.put your robots.txt to http://localhost/robots.txt with these lines: User-agent: * Disallow: / 2.crawl some page of localhost 3.you will get the contents…

GoogleCodeExporter updated 9 years ago
3
sageone/crawler4j #128

crawler4j ignores robots.txt

``` What steps will reproduce the problem? 1.put your robots.txt to http://localhost/robots.txt with these lines: User-agent: * Disallow: / 2.crawl some page of localhost 3.you will get the contents…

GoogleCodeExporter updated 9 years ago
3
khuongduyit/crawler4j #128

crawler4j ignores robots.txt

``` What steps will reproduce the problem? 1.put your robots.txt to http://localhost/robots.txt with these lines: User-agent: * Disallow: / 2.crawl some page of localhost 3.you will get the contents…

GoogleCodeExporter updated 9 years ago
3

上一页 1...49 50 51 52 53 54 55...59 下一页

583 results for heritrix

583 results
for heritrix