-
The news crawler is configured to be polite with a guaranteed fetch delay of few seconds. However, some robots.txt rules define a crawl-delay below one second which then overwrites the the configured …
-
If the crawl topology died or was killed the WARC file is not properly closed. This causes an error when decompressing the WARC file: `gzip: CC-NEWS-20160926233041-00001.warc.gz: unexpected end of fil…
-
-
When crawling using high values in **http.content.limit** configuration property, we are getting this exception:
`java.lang.StackOverflowError at org.jsoup.helper.StringUtil.stringBuilder(StringUti…
-
Upgrade StormCrawler ES version to 6.5.1
Latest ES release 6.5.1 contains several bug fixes targerting ES 6.5.0 related to "aggregation", see https://www.elastic.co/guide/en/elasticsearch/reference…
-
From looking at my scraped results for webmd.com, it seems it may not and I guess it's way too much to expect that it would since that would be very complicated. But I figured I'd ask anyway to double…
-
This is a bug caused by the changes introduced in #653
-
The value of the `If-Modified-Since` header must follow the date format defined in [RFC 7231](https://tools.ietf.org/html/rfc7231#section-7.1.1.1) (e.g., "Thu, 22 Dec 2017 07:51:19 UTC", see also [RFC…
-
I have a feeling I'm misunderstanding something. I assumed Stormcrawler worked similar to Nutch and that it would store the parsed contents of each fetched page directly in elasticsearch as it crawls.…
-
I'm using everything out of the box - the latest Stormcrawler with the latest elasticsearch stuff on github and the latest Elasticsearch (though I'm currently trying an older version of ES but getting…