-
The URLFrontier Spout
( https://github.com/apache/incubator-stormcrawler/blob/main/external/urlfrontier/src/main/java/org/apache/stormcrawler/urlfrontier/Spout.java ) doesn't take into account the cr…
-
340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using [HTTP/2](https://en.wikipedia.org/wiki/HTTP/2) after a [Java security upgrade](https://mai…
-
Upgrade Apache Storm, ElasticSearch and Kibana
This way the NewsCrawler will benefit from the many bugfixes and improvements provided by these components and make it easier ti add new functionaliti…
-
What kind of issue is this?
- [ ] Question. This issue tracker is not the best place for questions. If you want to ask how to do
something, or to understand why something isn't working the…
-
What kind of issue is this?
- [ ] Question. This issue tracker is not the best place for questions. If you want to ask how to do
something, or to understand why something isn't working the…
-
See [https://github.com/DigitalPebble/storm-crawler/issues/401]
-
This is a common task for all crawlers, see for instance [this discussion in StormCrawler](https://github.com/DigitalPebble/storm-crawler/issues/438)
There is code for that in [Tika](https://github…
-
-
The [website](https://stormcrawler.apache.org/getting-started/) also needs fixing
-
The request records in the CC-NEWS WARC files lack the HTTP protocol version:
```
GET /path
```
instead of
```
GET /path HTTP/1.1
```
This makes some WARC parsers fail to process the WARC fil…