-
```
See "Order of precedence for group-member records" section at the end of
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
```
Original issue reported on code.google…
-
Hi ,
I want to modify IndexingMapReduce.java file from nutch-indexing, but I'm not ale to recompile it back to the nutch-1.2.jar file. When the ran the provided build.xml file, it complained that it …
-
```
What steps will reproduce the problem?
Running the crawler crashes the JVM some times. I crawl around 10 web sites
regularly with pages between 1K to 50K. This happens randomly but happens very …
-
```
See "Order of precedence for group-member records" section at the end of
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
```
Original issue reported on code.google…
-
Hi.
I am using CDH5 5.0.2 which is latest.
and I have downloaded latest HiBench source.
All benchmark suites(wordcount, terasort, kmeans, hive bench, etc.) operate pretty well except nutchindexing.…
Jaeki updated
9 years ago
-
I am trying to use extractor as a html/index filter but I am getting a NPE when its trying to load the config file despite the fact that I have an extractors.xml file in the conf directory. Here is th…
-
I'm using the Apache Nutch Crawler to index websites. The default behaviour is to use the url as the unique identifier, which seemed like a good idea until now.
If the exported index contains fields …
-
```
What steps will reproduce the problem?
Running the crawler crashes the JVM some times. I crawl around 10 web sites
regularly with pages between 1K to 50K. This happens randomly but happens very …
-
```
What steps will reproduce the problem?
Running the crawler crashes the JVM some times. I crawl around 10 web sites
regularly with pages between 1K to 50K. This happens randomly but happens very …
-
HI, I am trying to run the Nutchindexing job (https://github.com/hibench/HiBench-2.1/tree/f1d43780f5ae813ccd4e891e353429e7871c9c41). It says that "Total input paths to process is 0".
Can anyone help m…