apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
885 stars 262 forks source link

Inquiry About StormCrawler Features and Capabilities #1253

Closed alikaz3mi closed 2 months ago

alikaz3mi commented 2 months ago

I have a few specific questions regarding the usage and features of StormCrawler that I hope you could clarify:

Storage of Textual Information: When using StormCrawler with Elasticsearch as outlined in your documentation, will all the textual information from the crawled websites be stored directly in Elasticsearch?

Handling Multimedia Content: How does StormCrawler manage images and other multimedia content found on websites? Are these types of content also stored in Elasticsearch, or do they require a different approach or storage solution?

Crawling Authenticated Websites: Is StormCrawler capable of crawling websites that require user authentication? If so, how can I provide authentication details (e.g., usernames and passwords) to enable StormCrawler to access and crawl these sites?

Your insights and guidance on these questions would be immensely helpful for the successful implementation of my project. I am excited about the potential of using StormCrawler and look forward to understanding its full capabilities.

jnioche commented 2 months ago

hi @alikaz3mi

Storage of Textual Information: When using StormCrawler with Elasticsearch as outlined in your documentation, will all the textual information from the crawled websites be stored directly in Elasticsearch?

The Elasticsearch module has been removed from SC due to licensing issues. You can use OpenSearch as an alternative. As explained in the documentation, the textual content of the pages is stored in the content index. A number of fields are configured to be indexed by default but this is extensible.

Handling Multimedia Content: How does StormCrawler manage images and other multimedia content found on websites? Are these types of content also stored in Elasticsearch, or do they require a different approach or storage solution?

By default StormCrawler does not crawl or index multimedia files but it can be done (in fact several organisations do that with StormCrawler on a large scale). You will have to use a custom bolt to store the content - you could put it in OpenSearch but other forms of storage are probably more appropriate depending on your use case.

Crawling Authenticated Websites: Is StormCrawler capable of crawling websites that require user authentication? If so, how can I provide authentication details (e.g., usernames and passwords) to enable StormCrawler to access and crawl these sites?

See https://github.com/apache/incubator-stormcrawler/wiki/Protocols There is currently support for basic authentication, see https://github.com/apache/incubator-stormcrawler/blob/701999eb56c5ebe5632b012a2f0771d6538425aa/core/src/main/java/com/digitalpebble/stormcrawler/protocol/okhttp/HttpProtocol.java#L157

Please note that it is not currently possible to handle authentication per hostname or domain - only a single pair of username / password can be set.

jnioche commented 2 months ago

@alikaz3mi will close this issue for now. Probably best to use the Discussions section instead of issues.