apache-nutch Search Results

353 results
for apache-nutch

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

govCMS/GovCMS7 #165

Do we need AGLS and why?

For a long time AGLS metadata has been stated as a 'mandatory requirement' for government websites. To the best of my knowledge this requirement originated in the [AGIMO Web Guide](http://webguide.go…

MadeByMike updated 6 years ago
12
crawler-commons/crawler-commons #168

<sitemapindex> not being processed

I recently ended up attempting to process the sitemap located at https://www.autotrader.com/sitemap.xml As you can see, the XML represents a sitemapindex as follows... ``` https://www.aut…

lewismc updated 7 years ago
5
commoncrawl/nutch #3

Add WARC field WARC-Identified-Payload-Type

The WARC response record header field [WARC-Identified-Payload-Type](https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#warc-identified-payload-type…

sebastian-nagel updated 7 years ago
1
apache/incubator-stormcrawler #435

File Protocol

I have adopted File protocol implementation in Apache Nutch to Storm Crawler. See the implementations in (https://gist.github.com/isspek/32e9d762666593b4781ef3a0155dd74b) It works but needs revision. …

isspek updated 7 years ago
1
USCDataScience/sparkler #55

URL filter regex

Hi, Am I missing the url filter ? How I can tell the sparkler app to url filter rules ? in general or per domain thx

MyraBaba updated 7 years ago
3
tolitius/cbass #6

:auto not supported on headerless data. {}

I got this error while scanning the data from hbase ( apache nutch data ) . ``` CompilerException clojure.lang.ExceptionInfo: :auto not supported on headerless data. {}, compiling```

ghost updated 7 years ago
2
USCDataScience/sparkler #92

Running iterations more than one in crawl gives null pointer…

Ran this command: java -jar sparkler-app/target/sparkler-app-0.1-SNAPSHOT.jar crawl -id sjob-1488053397220 -i 2 Got Exceptions of Null Pointer: 2017-02-27 14:20:25 WARN ParseFunction$:63 [Execut…

sk-s-hub updated 7 years ago
4
apache/incubator-stormcrawler #421

Seed injectors to normalize String > URL > String

If the round trip conversion String java.net.URL yields a different URL string, the crawl topology fails to properly update the status of fetched items. This happens if injected URLs contain trailing…

sebastian-nagel updated 7 years ago
3
USCDataScience/sparkler #87

crawling all links of same domain in Sparkler - best practic…

What would be the recommended approach to exclude outgoing links (to other domains) for subsequent crawls in Sparkler by default?

rainergeissendoerfer updated 7 years ago
2
USCDataScience/nutch-analytics #4

Parser Error in Spark environment

The parsing is working fine in the local environment but producing errors when running in Spark environment **Cause:** https://github.com/USCDataScience/nutch-analytics/blob/master/src/main/scala/g…

karanjeets updated 7 years ago
2

上一页 1...16 17 18 19 20 21 22...36 下一页

353 results for apache-nutch

353 results
for apache-nutch