-
For a long time AGLS metadata has been stated as a 'mandatory requirement' for government websites.
To the best of my knowledge this requirement originated in the [AGIMO Web Guide](http://webguide.go…
-
I recently ended up attempting to process the sitemap located at https://www.autotrader.com/sitemap.xml
As you can see, the XML represents a sitemapindex as follows...
```
https://www.aut…
-
The WARC response record header field [WARC-Identified-Payload-Type](https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#warc-identified-payload-type…
-
I have adopted File protocol implementation in Apache Nutch to Storm Crawler. See the implementations in (https://gist.github.com/isspek/32e9d762666593b4781ef3a0155dd74b) It works but needs revision. …
-
Hi,
Am I missing the url filter ? How I can tell the sparkler app to url filter rules ?
in general or per domain
thx
-
I got this error while scanning the data from hbase ( apache nutch data ) .
```
CompilerException clojure.lang.ExceptionInfo: :auto not supported on headerless data. {}, compiling```
ghost updated
7 years ago
-
Ran this command: java -jar sparkler-app/target/sparkler-app-0.1-SNAPSHOT.jar crawl -id sjob-1488053397220 -i 2
Got Exceptions of Null Pointer:
2017-02-27 14:20:25 WARN ParseFunction$:63 [Execut…
-
If the round trip conversion String java.net.URL yields a different URL string, the crawl topology fails to properly update the status of fetched items. This happens if injected URLs contain trailing…
-
What would be the recommended approach to exclude outgoing links (to other domains) for subsequent crawls in Sparkler by default?
-
The parsing is working fine in the local environment but producing errors when running in Spark environment
**Cause:** https://github.com/USCDataScience/nutch-analytics/blob/master/src/main/scala/g…