-
The LICENSE and NOTICE are in the RC jars but not the DISCLAIMER. I'm not sure if it 100% required but many podlings include it.
I checked stormcrawler-core-3.1.0.jar
-
This is a big one, but it's possible that most of this crawler should be replaced with Apache Nutch or similar. I originally hacked this out as a proof-of-concept but as usual, it grew a bit from the…
-
The [maven archetype pom](https://github.com/apache/incubator-stormcrawler/blob/main/archetype/src/main/resources/archetype-resources/pom.xml#L35) has `storm.version 2.6.2`. When using this with `stor…
-
I checked the 3.1.0 RC jars and the files in stormcrawler-core-3.1.0.jar all had this date.
07-04-2024
I am in favour of reproducible builds meaning that files in jar may not have today's date b…
-
The methods of the initial version of the metrics are marked as deprecated, we should port the code to use the new mechanism
-
This is a reminder, that we need to do
> Are 3rd parties respecting and correctly using the podlings
name and brand? If not what actions has the PPMC taken to
correct this? Has the VP, Brand app…
-
I have a few specific questions regarding the usage and features of StormCrawler that I hope you could clarify:
Storage of Textual Information: When using StormCrawler with Elasticsearch as outline…
-
Nutch' protocol-okhttp supports HTTP/2 since its introduction in 2018. Alone, the WARC writer does not.
The following points need to be addressed:
- [x] protocol-okhttp: record HTTP and SSL/TLS ve…
-
Scripts and softwares for automated scrapping must follow robots.txt rules, otherwise it may make the user liable for unauthorised use of data.
-
I currently own the domain and will get it redirected to the Apache version of the site