-
I have a lot of FETCH_ERROR (about ten percent on one million french url).
On debug i can see this error : org.apache.http.NoHttpResponseException: The target server failed to respond
Sometimes, it'…
-
I follow tutorial on http://stormcrawler.net/getting-started/ by watching youtube.
After I type fallowing line on my command line
I got error on my spout
storm jar target/tutorial-1.0-SNA…
-
(see [NUTCH-2729](https://issues.apache.org/jira/browse/NUTCH-2729) and commoncrawl/nutch#10 for the same issue in Nutch)
The marking of trimmed content (by content limit) is not reliable and repro…
-
Hi there!
Thanks you for creating this project! It was just what I was looking for to test upgrading to storm 1.0.1.
I've copied your docker-compose configuration and it seems to be running, but I a…
-
**The used environment:**
- Default ES cluster with Kibana deployed on docker swarm. Both in 7.0.1 version
- Step by step creation of the topology based on your [guide](https://github.com/DigitalPeb…
-
The news crawler uses the domain name to manage fetch queues. The domain name is also used to route URLs to Elasticsearch shards. When a URL is re-fetched the existing routing key isn't reused, instea…
-
When a (local) topology is killed and no tuples have been passed to the WARCHdfsBolt, the cleanup() will raise a NPE:
```
68227 [Thread-91-warc-executor[36 36]] INFO o.a.s.util - Async loop interru…
-
Hi,
I Just Follow the Readme..
I Create the Uber Jar Using the mvn clean package
but i am getting this error.
Error: Could not find or load main class com.digitalpebble.stormcrawler.elasticsear…
-
Hi,
this might be a silly question, but still.
I noticed that `SiteMapParser.parseSiteMap()` returns `AbstractSiteMap`, can you give me some examples of how this is intended to be used?
Thank…
-
#645 was a good idea in theory but needs fixing. The idea was to prevent pages from having their outlinks followed unless they had been flagged as being a sitemap (or not), basically, we have sitemaps…