Closed MyraBaba closed 7 years ago
I seed 400 urls (which is only 80 of them taken inside the ES I dont know why)
Try with a larger TTL value if you use one
Can you describe your topology + share your full config including the ES one?
here the config files for the elasticsearch example folders:
crawler-config:
# Default configuration for StormCrawler
# This is used to make the default values explicit and list the most common configurations.
# Do not modify this file but instead provide a custom one with the parameter -config
# when launching your extension of ConfigurableTopology.
config:
fetcher.server.delay: 0.2
fetcher.server.min.delay: 0.0
fetcher.queue.mode: "byHost"
fetcher.threads.per.queue: 2
fetcher.threads.number: 200
fetcher.max.urls.in.queues: -1
# time bucket to use for the metrics sent by the Fetcher
fetcher.metrics.time.bucket.secs: 10
# alternative values are "byIP" and "byDomain"
partition.url.mode: "byHost"
# metadata to transfer to the outlinks
# used by Fetcher for redirections, sitemapparser, etc...
# these are also persisted for the parent document (see below)
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
# these are not transfered to the outlinks
metadata.persist:
- _redirTo
- error.cause
- error.source
- isSitemap
- isFeed
metadata.track.path: true
metadata.track.depth: true
http.agent.name: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36"
http.agent.version: "1.0"
http.agent.description: "A StormCrawler-based crawler"
http.agent.url: "http://someorganization.com/"
http.agent.email: "someone@someorganization.com"
http.accept.language: "en-us,en-gb,en;q=0.7,*;q=0.3"
http.accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
http.content.limit: -1
http.store.responsetime: true
http.store.headers: false
http.timeout: 10000
http.robots.403.allow: true
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# should the URLs be removed when a page is marked as noFollow
robots.noFollow.strict: false
protocols: "http,https"
http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol"
https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol"
# no url or parsefilters by default
# parsefilters.config.file: "parsefilters.json"
# urlfilters.config.file: "urlfilters.json"
# JSoupParserBolt
jsoup.treat.non.html.as.error: true
parser.emitOutlinks: true
track.anchors: true
detect.mimetype: true
detect.charset.maxlength: 2048
# whether the sitemap parser should try to
# determine whether a page is a sitemap based
# on its content if it is missing the K/V in the metadata
sitemap.sniffContent: false
# filters URLs in sitemaps based on their modified Date (if any)
sitemap.filter.hours.since.modified: -1
# whether to add any sitemaps found in the robots.txt to the status stream
# used by fetcher bolts. sitemap.sniffContent must be set to true if the
# discovery is enabled
sitemap.discovery: false
# Default implementation of Scheduler
scheduler.class: "com.digitalpebble.stormcrawler.persistence.DefaultScheduler"
# revisit a page daily (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.default: -1
# revisit a page with a fetch error after 2 hours (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.fetch.error: 120
# never revisit a page with an error (or set a value in minutes)
fetchInterval.error: -1
fetcher.server.delay: 0.2
# custom fetch interval to be used when a document has the key/value in its metadata
# and has been fetched succesfully (value in minutes)
# fetchInterval.isFeed=true: 10
# max number of successive fetch errors before changing status to ERROR
max.fetch.errors: 3
# Guava cache use by AbstractStatusUpdaterBolt for DISCOVERED URLs
status.updater.use.cache: true
status.updater.cache.spec: "maximumSize=10000,expireAfterAccess=1h"
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
and es-conf:
# configuration for Elasticsearch resources
config:
# ES indexer bolt
es.indexer.addresses: "localhost:9300"
es.indexer.index.name: "index"
es.indexer.doc.type: "doc"
es.indexer.create: false
es.indexer.settings:
cluster.name: "elasticsearch"
# ES metricsConsumer
es.metrics.addresses: "localhost:9300"
es.metrics.index.name: "metrics"
es.metrics.doc.type: "datapoint"
es.metrics.settings:
cluster.name: "elasticsearch"
# ES metrics whitelist. Only metrics in this list will be written to ES
# es.metrics.whitelist:
# - fetcher_counter
# - fetcher_average.bytes_fetched
# ES metrics blacklist. Never write these metrics to ES
# es.metrics.blacklist:
# - __receive.capacity
# - __receive.read_pos
# ES spout and persistence bolt
es.status.addresses: "localhost:9300"
es.status.index.name: "status"
es.status.doc.type: "status"
# the routing is done on the value of 'partition.url.mode'
es.status.routing: true
# stores the value used for the routing as a separate field
es.status.routing.fieldname: "metadata.hostname"
es.status.bulkActions: 500
es.status.flushInterval: "5s"
es.status.concurrentRequests: 1
es.status.settings:
cluster.name: "elasticsearch"
# used by spouts - time in secs for which the URLs will be considered for fetching after a ack of fail
es.status.ttl.purgatory: 30
# Min time (in msecs) to allow between 2 successive queries to ES
es.status.min.delay.queries: 2000
# ElasticSearchSpout
# ES Spout throttling. Uses the value of 'partition.url.mode' for the bucket key.
es.status.max.inflight.urls.per.bucket: -1
es.status.sort.field: "nextFetchDate"
# limits the deep paging by resetting the start offset for the ES query
es.status.max.secs.date: 100
# AggregationSpout
es.status.max.buckets: 50
es.status.max.urls.per.bucket: 2
# field to group the URLs into buckets
es.status.bucket.field: "_routing"
# field to sort the URLs within a bucket
es.status.bucket.sort.field: "nextFetchDate"
# field to sort the buckets
es.status.global.sort.field: "nextFetchDate"
topology.metrics.consumer.register:
- class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
parallelism.hint: 1
topology.workers: 1
topology.message.timeout.secs: 300
topology.max.spout.pending: 10
topology.debug: false
fetcher.max.urls.in.queues: -1
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# metadata to transfer to the outlinks
# used by Fetcher for redirections, sitemapparser, etc...
# these are also persisted for the parent document (see below)
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
# these are not transfered to the outlinks
metadata.persist:
- _redirTo
- error.cause
- error.source
- isSitemap
- isFeed
#
# http.agent.name: "Anonymous Coward"
# http.agent.version: "1.0"
# http.agent.description: "A StormCrawler-based crawler"
# http.agent.url: "http://someorganization.com/"
# http.agent.email: "someone@someorganization.com"
#
Using [https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/ESCrawlTopology.java]? Make sure the number of shards is the same as what the ES init script specified.
topology.max.spout.pending: 10
is likely to be too low for 200 threads.
The AggregationSpout or SamplerAggregationSpout is likely to give you better performance, especially as the index starts growing. I might change the example topology to that
Number of shards:
in the ESCrawlTopology.java:
TopologyBuilder builder = new TopologyBuilder();
int numWorkers = ConfUtils.getInt(getConf(), "topology.workers", 2);
int numFetchers = ConfUtils.getInt(getConf(), "fetcher.threads.number",
50);
// set to the real number of shards ONLY if es.status.routing is set to
// true in the configuration
int numShards = 10; // IT WAS 1 . I CHANGED to 10
in the ES init script there is different shard numbers:
# deletes and recreates a status index with a bespoke schema
curl -s -XDELETE 'http://localhost:9200/status/' > /dev/null
echo "Deleted status index"
# http://localhost:9200/status/_mapping/status?pretty
echo "Creating status index with mapping"
curl -s -XPOST localhost:9200/status -d '
{
"settings": {
"index": {
"number_of_shards": 10,
"number_of_replicas": 1,
"refresh_interval": "5s"
}
},
"mappings": {
"status": {
"dynamic_templates": [{
"metadata": {
"path_match": "metadata.*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}],
"_source": {
"enabled": true
},
"_all": {
"enabled": false
},
"properties": {
"nextFetchDate": {
"type": "date",
"format": "dateOptionalTime"
},
"status": {
"type": "string",
"index": "not_analyzed"
},
"url": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
# deletes and recreates a status index with a bespoke schema
curl -s -XDELETE 'http://localhost:9200/metrics*/' > /dev/null
echo ""
echo "Deleted metrics index"
echo "Creating metrics index with mapping"
# http://localhost:9200/metrics/_mapping/status?pretty
curl -s -XPOST localhost:9200/_template/storm-metrics-template -d '
{
"template": "metrics*",
"settings": {
"index": {
"number_of_shards": 1,
"refresh_interval": "5s"
},
"number_of_replicas" : 0
},
"mappings": {
"datapoint": {
"_all": { "enabled": false },
"_source": { "enabled": true },
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"srcComponentId": {
"type": "string",
"index": "not_analyzed"
},
"srcTaskId": {
"type": "long"
},
"srcWorkerHost": {
"type": "string",
"index": "not_analyzed"
},
"srcWorkerPort": {
"type": "long"
},
"timestamp": {
"type": "date",
"format": "dateOptionalTime"
},
"value": {
"type": "double"
}
}
}
}
}'
echo ""
For The topology.max.spout.pending: change to 100
Basicly we are currently crawling daily almost 30M url (including all parsing , meta data , indexing etc.)
We try to understand advantages (specially for the speeding up the process) of the storm-crawler first of all..
On 27 Ara 2016, at 22:39, Julien Nioche notifications@github.com wrote:
Using [https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/ESCrawlTopology.java]? Make sure the number of shards is the same as what the ES init script specified.
topology.max.spout.pending: 10 is likely to be too low for 200 threads.
The AggregationSpout or SamplerAggregationSpout is likely to give you better performance, especially as the index starts growing. I might change the example topology to that
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DigitalPebble/storm-crawler/issues/396#issuecomment-269372812, or mute the thread https://github.com/notifications/unsubscribe-auth/AQscn56xb8pCSAHMParMrmBfwvg1I5gXks5rMWlqgaJpZM4LWNgj.
the max.spout.pending value is per spout instance. You are using 10 which is also the # of shards, so it's all good.
Note that the screenshots you took are probably an average over the fetcher bolt instances - you have 50 of them. A better way of assessing the speed and bottlenecks is by looking at the Storm UI on port 8080 and in the resulting index.
using the FetcherBolt (a single instance will do) will give you multithreading per host - which the SimpleFetcherBolt won't do + more intelligible metrics. As I pointed out earlier you'll get better perfs with the AggregationBolt.
In any way, check the Storm UI and logs to get a better understanding of the perfs. SC give you plenty of options for fine-tuning and it takes a bit of time to get used to the way it works.
For the sake of comparison \: I'll be publishing a blog post next week with a comparison with Apache Nutch on a single machine over 1K seed URLs. Don't want to spoil the suspense but StormCrawler comes on top ;-)
Closing for now - feel free to open a new issue if you find something which looks like a bug or want a new features. For general questions, the mailing list or stack overflow would probably be better. Thanks!
Hi Again,
As you suggested I change the fetcher to the FetcherBolt instead of the SimpleFetcherBolt in the elastic search config. Now I have a lot of error:
org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
68648 [Thread-24-spout-executor[33 33]] INFO c.d.s.e.p.ElasticSearchSpout - ES query returned 100 hits in 67 msec 68653 [Thread-66-spout-executor[27 27]] INFO c.d.s.e.p.ElasticSearchSpout - ES query returned 100 hits in 72 msec 68653 [Thread-48-fetch-executor[13 13]] INFO c.d.s.b.FetcherBolt - [Fetcher #13] Threads : 5 queues : 4 in_queues : 1090 68654 [Thread-15-fetch-executor[7 7]] INFO c.d.s.b.FetcherBolt - [Fetcher #7] Threads : 3 queues : 2 in_queues : 207 68654 [Thread-50-fetch-executor[6 6]] INFO c.d.s.b.FetcherBolt - [Fetcher #6] Threads : 5 queues : 9 in_queues : 1045 68654 [Thread-50-fetch-executor[6 6]] INFO c.d.s.b.FetcherBolt - [Fetcher #6] Threads : 5 queues : 9 in_queues : 1046 68655 [Thread-15-fetch-executor[7 7]] INFO c.d.s.b.FetcherBolt - [Fetcher #7] Threads : 3 queues : 2 in_queues : 208 68660 [FetcherThread] ERROR c.d.s.b.FetcherBolt - Exception while fetching http://www.ajansspor.com/futbol/takim/bate_borisov/ org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:71) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:220) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:164) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:139) ~[httpclient-4.4.1.jar:4.4.1] at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:148) ~[classes/:?] at com.digitalpebble.stormcrawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:493) [classes/:?] 68660 [Thread-15-fetch-executor[7 7]] INFO c.d.s.b.FetcherBolt - [Fetcher #7] Threads : 3 queues : 2 in_queues : 209 68662 [Thread-50-fetch-executor[6 6]] INFO c.d.s.b.FetcherBolt - [Fetcher #6] Threads : 5 queues : 9 in_queues : 1047 68662 [Thread-50-fetch-executor[6 6]] INFO c.d.s.b.FetcherBolt - [Fetcher #6] Threads : 5 queues : 9 in_queues : 1048 68663 [Thread-15-fetch-executor[7 7]] INFO c.d.s.b.FetcherBolt - [Fetcher #7] Threads : 3 queues : 2 in_queues : 210 68663 [Thread-48-fetch-executor[13 13]] INFO c.d.s.b.FetcherBolt - [Fetcher #13] Threads : 5 queues : 4 in_queues : 1091 68663 [Thread-48-fetch-executor[13 13]] INFO c.d.s.b.FetcherBolt - [Fetcher #13] Threads : 5 queues : 4 in_queues : 1092 68666 [Thread-68-spout-executor[29 29]] INFO c.d.s.e.p.ElasticSearchSpout - ES query returned 100 hits in 82 msec 68672 [FetcherThread] INFO c.d.s.b.FetcherBolt - [Fetcher #15] Fetched http://www.sporx.com/motorsporlari/diger/ with status 200 in msec 290 68672 [FetcherThread] ERROR c.d.s.b.FetcherBolt - Exception while fetching http://www.sporx.com/rio2016/_assets/ajax/branslar.php?id=35 org.apache.http.NoHttpResponseException: The target server failed to respond at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261) ~[httpcore-4.4.1.jar:4.4.1] at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165) ~[httpcore-4.4.1.jar:4.4.1] at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272) ~[httpcore-4.4.1.jar:4.4.1] at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124) ~[httpcore-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:71) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:220) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:164) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:139) ~[httpclient-4.4.1.jar:4.4.1] at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:148) ~[classes/:?] at com.digitalpebble.stormcrawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:493) [classes/:?] 68673 [FetcherThread] ERROR c.d.s.b.FetcherBolt - Exception while fetching http://www.star.com.tr/teog-sinav-bilgisi/ org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:71) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:220) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:164) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:139) ~[httpclient-4.4.1.jar:4.4.1] at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:148) ~[classes/:?] at com.digitalpebble.stormcrawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:493) [classes/:?] 68673 [FetcherThread] ERROR c.d.s.b.FetcherBolt - Exception while fetching http://www.star.com.tr/teog-sinav-bilgisi/ org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:71) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:220) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:164) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:139) ~[httpclient-4.4.1.jar:4.4.1] at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:148) ~[classes/:?] at com.digitalpebble.stormcrawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:493) [classes/:?]
On 28 Ara 2016, at 17:10, Julien Nioche notifications@github.com wrote:
the max.spout.pending value is per spout instance. You are using 10 which is also the # of shards, so it's all good.
Note that the screenshots you took are probably an average over the fetcher bolt instances - you have 50 of them. A better way of assessing the speed and bottlenecks is by looking at the Storm UI on port 8080 and in the resulting index.
using the FetcherBolt (a single instance will do) will give you multithreading per host - which the SimpleFetcherBolt won't do + more intelligible metrics. As I pointed out earlier you'll get better perfs with the AggregationBolt.
In any way, check the Storm UI and logs to get a better understanding of the perfs. SC give you plenty of options for fine-tuning and it takes a bit of time to get used to the way it works.
For the sake of comparison : I'll be publishing a blog post next week with a comparison with Apache Nutch on a single machine over 1K seed URLs. Don't want to spoil the suspense but StormCrawler comes on top ;-)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DigitalPebble/storm-crawler/issues/396#issuecomment-269481926, or mute the thread https://github.com/notifications/unsubscribe-auth/AQscn-ZFowBq8kf67n9PiLYMRUqFlNPmks5rMm3qgaJpZM4LWNgj.
You should use only 1 instance of the FetcherBolt per worker - they are all competing for connections from the protocol. The difference between the SimpleFetcherBolt and the FetcherBolt is that the latter is a single instance creating sub fetching threads whereas the SFB instances are the fetching threads.
Hi ,
This is obviously a configuration issue. But I couldnt find elsewhere to write:
I couldnt get the full throttle of the storm crawler. I have plenty bandwidth and resources.
I seed 400 urls (which is only 80 of them taken inside the ES I dont know why) . and :
etcher.server.delay: 0.2 fetcher.server.min.delay: 0.0 fetcher.queue.mode: "byHost" fetcher.threads.per.queue: 2 fetcher.threads.number: 200 fetcher.max.urls.in.queues: -1
depth is 3 also.
When I look i didnt see much bandwidth usage. What else the other option to get %100 speed and the power of the storm crawler ? testing local now and more than enough resources.
Is there any config that I missed ?