-
```
What steps will reproduce the problem?
1.put your robots.txt to http://localhost/robots.txt with these lines:
User-agent: *
Disallow: /
2.crawl some page of localhost
3.you will get the contents…
-
```
What steps will reproduce the problem?
1. try giving arc2warc.sh an input directory (-d ) that contains
'arc' - e.g. arc2warc.sh -d /home/crawler/heritrix-1.12.1/jobs/test/arcs/
What is the expec…
-
-
```
What steps will reproduce the problem?
1.put your robots.txt to http://localhost/robots.txt with these lines:
User-agent: *
Disallow: /
2.crawl some page of localhost
3.you will get the contents…
-
```
What steps will reproduce the problem?
1. Spider a website
2. Start a new session
What is the expected output? What do you see instead?
Would it be possible to get an option to clear it or som…
-
-
```
What steps will reproduce the problem?
1.put your robots.txt to http://localhost/robots.txt with these lines:
User-agent: *
Disallow: /
2.crawl some page of localhost
3.you will get the contents…
-
```
What steps will reproduce the problem?
1. if the urls contain '\'
example:
http://www.lngs.gov.cn/newFormsFolders\LNGS_FORMS_633800715869843750XQJ.doc
the browser can recognizes the url
What i…
-
Observe the log when WAIL first started. Errors from phantomjs, which I believe are caused by Heritrix not yet being accessible.
-
Looking at a heritrix request with tcpdump, you can see that a separate tcp packet is sent for each of the characters 'G' 'E' 'T' ' ' '/' ' ' 'H' 'T' 'T'... at the beginning of each http request. We r…