-
e.g., leaving both the FIX and KILL buttons clickable.
``` sh
Trying to access Heritrix service at https://127.0.0.1:8443
Failed to access Heritrix service at https://127.0.0.1:8443
Trying to access …
-
An alert appears when the app is first installed saying "To open 'java', you need a Java SE runtime. Would you like to install one now.
[Not Now] [Install]
Clicking Install instructs the system's S…
-
```
What steps will reproduce the problem?
1.put your robots.txt to http://localhost/robots.txt with these lines:
User-agent: *
Disallow: /
2.crawl some page of localhost
3.you will get the contents…
-
```
What steps will reproduce the problem?
1. if the urls contain '\'
example:
http://www.lngs.gov.cn/newFormsFolders\LNGS_FORMS_633800715869843750XQJ.doc
the browser can recognizes the url
What i…
-
@aliceranzhou I noticed you have in a few places this magical incantation that allows the records to be serialized:
```
.set("spark.serializer", "org.apache.spark.serializer.KyroSerializer")
…
-
```
What steps will reproduce the problem?
1. try giving arc2warc.sh an input directory (-d ) that contains
'arc' - e.g. arc2warc.sh -d /home/crawler/heritrix-1.12.1/jobs/test/arcs/
What is the expec…
-
`RobotRule.blocksPathForUA(String, String)` returns `false` for any paths with this robots.txt:
```
User-agent: *
Disallow:
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-login.php
Disallow…
-
```
What steps will reproduce the problem?
1. if the urls contain '\'
example:
http://www.lngs.gov.cn/newFormsFolders\LNGS_FORMS_633800715869843750XQJ.doc
the browser can recognizes the url
What i…
-
```
What steps will reproduce the problem?
1.put your robots.txt to http://localhost/robots.txt with these lines:
User-agent: *
Disallow: /
2.crawl some page of localhost
3.you will get the contents…
-
```
What steps will reproduce the problem?
1.put your robots.txt to http://localhost/robots.txt with these lines:
User-agent: *
Disallow: /
2.crawl some page of localhost
3.you will get the contents…