apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
878 stars 258 forks source link

Slow fetching - #396

Closed MyraBaba closed 7 years ago

MyraBaba commented 7 years ago

Hi ,

This is obviously a configuration issue. But I couldnt find elsewhere to write:

I couldnt get the full throttle of the storm crawler. I have plenty bandwidth image and resources.

I seed 400 urls (which is only 80 of them taken inside the ES I dont know why) . and :

etcher.server.delay: 0.2 fetcher.server.min.delay: 0.0 fetcher.queue.mode: "byHost" fetcher.threads.per.queue: 2 fetcher.threads.number: 200 fetcher.max.urls.in.queues: -1

depth is 3 also.

When I look i didnt see much bandwidth usage. What else the other option to get %100 speed and the power of the storm crawler ? testing local now and more than enough resources.

Is there any config that I missed ?

jnioche commented 7 years ago

I seed 400 urls (which is only 80 of them taken inside the ES I dont know why)

Try with a larger TTL value if you use one

Can you describe your topology + share your full config including the ES one?

MyraBaba commented 7 years ago

here the config files for the elasticsearch example folders:

crawler-config:

# Default configuration for StormCrawler
# This is used to make the default values explicit and list the most common configurations.
# Do not modify this file but instead provide a custom one with the parameter -config
# when launching your extension of ConfigurableTopology.

config:
  fetcher.server.delay: 0.2
  fetcher.server.min.delay: 0.0
  fetcher.queue.mode: "byHost"
  fetcher.threads.per.queue: 2
  fetcher.threads.number: 200
  fetcher.max.urls.in.queues: -1

  # time bucket to use for the metrics sent by the Fetcher
  fetcher.metrics.time.bucket.secs: 10

  # alternative values are "byIP" and "byDomain"
  partition.url.mode: "byHost"

  # metadata to transfer to the outlinks
  # used by Fetcher for redirections, sitemapparser, etc...
  # these are also persisted for the parent document (see below)
  # metadata.transfer:
  # - customMetadataName

  # lists the metadata to persist to storage
  # these are not transfered to the outlinks
  metadata.persist:
   - _redirTo
   - error.cause
   - error.source
   - isSitemap
   - isFeed

  metadata.track.path: true
  metadata.track.depth: true

  http.agent.name: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36"
  http.agent.version: "1.0"
  http.agent.description: "A StormCrawler-based crawler"
  http.agent.url: "http://someorganization.com/"
  http.agent.email: "someone@someorganization.com"

  http.accept.language: "en-us,en-gb,en;q=0.7,*;q=0.3"
  http.accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
  http.content.limit: -1
  http.store.responsetime: true
  http.store.headers: false
  http.timeout: 10000

  http.robots.403.allow: true

  parsefilters.config.file: "parsefilters.json"
  urlfilters.config.file: "urlfilters.json"

  # should the URLs be removed when a page is marked as noFollow
  robots.noFollow.strict: false

  protocols: "http,https"
  http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol"
  https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol"

  # no url or parsefilters by default
  # parsefilters.config.file: "parsefilters.json"
  # urlfilters.config.file: "urlfilters.json"

  # JSoupParserBolt
  jsoup.treat.non.html.as.error: true
  parser.emitOutlinks: true
  track.anchors: true
  detect.mimetype: true
  detect.charset.maxlength: 2048

  # whether the sitemap parser should try to
  # determine whether a page is a sitemap based
  # on its content if it is missing the K/V in the metadata
  sitemap.sniffContent: false

  # filters URLs in sitemaps based on their modified Date (if any)
  sitemap.filter.hours.since.modified: -1

  # whether to add any sitemaps found in the robots.txt to the status stream
  # used by fetcher bolts. sitemap.sniffContent must be set to true if the
  # discovery is enabled
  sitemap.discovery: false

  # Default implementation of Scheduler
  scheduler.class: "com.digitalpebble.stormcrawler.persistence.DefaultScheduler"

  # revisit a page daily (value in minutes)
  # set it to -1 to never refetch a page
  fetchInterval.default: -1

  # revisit a page with a fetch error after 2 hours (value in minutes)
  # set it to -1 to never refetch a page
  fetchInterval.fetch.error: 120

  # never revisit a page with an error (or set a value in minutes)
  fetchInterval.error: -1
  fetcher.server.delay: 0.2
  # custom fetch interval to be used when a document has the key/value in its metadata
  # and has been fetched succesfully (value in minutes)
  # fetchInterval.isFeed=true: 10

  # max number of successive fetch errors before changing status to ERROR
  max.fetch.errors: 3

  # Guava cache use by AbstractStatusUpdaterBolt for DISCOVERED URLs
  status.updater.use.cache: true
  status.updater.cache.spec: "maximumSize=10000,expireAfterAccess=1h"

  # configuration for the classes extending AbstractIndexerBolt
  # indexer.md.filter: "someKey=aValue"
  indexer.url.fieldname: "url"
  indexer.text.fieldname: "content"
  indexer.canonical.name: "canonical"
  indexer.md.mapping:
  - parse.title=title
  - parse.keywords=keywords
  - parse.description=description

and es-conf:


# configuration for Elasticsearch resources

config:
  # ES indexer bolt
  es.indexer.addresses: "localhost:9300"
  es.indexer.index.name: "index"
  es.indexer.doc.type: "doc"
  es.indexer.create: false
  es.indexer.settings:
    cluster.name: "elasticsearch"

  # ES metricsConsumer
  es.metrics.addresses: "localhost:9300"
  es.metrics.index.name: "metrics"
  es.metrics.doc.type: "datapoint"
  es.metrics.settings:
    cluster.name: "elasticsearch"

  # ES metrics whitelist. Only metrics in this list will be written to ES
  # es.metrics.whitelist:
  # - fetcher_counter
  # - fetcher_average.bytes_fetched

  # ES metrics blacklist. Never write these metrics to ES
  # es.metrics.blacklist:
  # - __receive.capacity
  # - __receive.read_pos

  # ES spout and persistence bolt
  es.status.addresses: "localhost:9300"
  es.status.index.name: "status"
  es.status.doc.type: "status"
  # the routing is done on the value of 'partition.url.mode'
  es.status.routing: true
  # stores the value used for the routing as a separate field
  es.status.routing.fieldname: "metadata.hostname"
  es.status.bulkActions: 500
  es.status.flushInterval: "5s"
  es.status.concurrentRequests: 1
  es.status.settings:
    cluster.name: "elasticsearch"

  # used by spouts - time in secs for which the URLs will be considered for fetching after a ack of fail
  es.status.ttl.purgatory: 30

  # Min time (in msecs) to allow between 2 successive queries to ES
  es.status.min.delay.queries: 2000

  # ElasticSearchSpout
  # ES Spout throttling. Uses the value of 'partition.url.mode' for the bucket key.
  es.status.max.inflight.urls.per.bucket: -1
  es.status.sort.field: "nextFetchDate"
  # limits the deep paging by resetting the start offset for the ES query 
  es.status.max.secs.date: 100

  # AggregationSpout
  es.status.max.buckets: 50
  es.status.max.urls.per.bucket: 2
  # field to group the URLs into buckets
  es.status.bucket.field: "_routing"
  # field to sort the URLs within a bucket
  es.status.bucket.sort.field: "nextFetchDate"
  # field to sort the buckets
  es.status.global.sort.field: "nextFetchDate"

  topology.metrics.consumer.register:
       - class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
         parallelism.hint: 1
  topology.workers: 1
  topology.message.timeout.secs: 300
  topology.max.spout.pending: 10
  topology.debug: false
  fetcher.max.urls.in.queues: -1

  # mandatory when using Flux
  topology.kryo.register:
    - com.digitalpebble.stormcrawler.Metadata

  # metadata to transfer to the outlinks
  # used by Fetcher for redirections, sitemapparser, etc...
  # these are also persisted for the parent document (see below)
  # metadata.transfer:
  # - customMetadataName

  # lists the metadata to persist to storage
  # these are not transfered to the outlinks
  metadata.persist:
   - _redirTo
   - error.cause
   - error.source
   - isSitemap
   - isFeed

#
#  http.agent.name: "Anonymous Coward"
#  http.agent.version: "1.0"
#  http.agent.description: "A StormCrawler-based crawler"
#  http.agent.url: "http://someorganization.com/"
#  http.agent.email: "someone@someorganization.com"
#
jnioche commented 7 years ago

Using [https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/ESCrawlTopology.java]? Make sure the number of shards is the same as what the ES init script specified.

topology.max.spout.pending: 10 is likely to be too low for 200 threads.

The AggregationSpout or SamplerAggregationSpout is likely to give you better performance, especially as the index starts growing. I might change the example topology to that

MyraBaba commented 7 years ago

Number of shards:

in the ESCrawlTopology.java:

TopologyBuilder builder = new TopologyBuilder();

int numWorkers = ConfUtils.getInt(getConf(), "topology.workers", 2);

int numFetchers = ConfUtils.getInt(getConf(), "fetcher.threads.number",
        50);

// set to the real number of shards ONLY if es.status.routing is set to
// true in the configuration
int numShards = 10;   //  IT WAS 1 .  I CHANGED to 10

in the ES init script there is different shard numbers:

# deletes and recreates a status index with a bespoke schema

curl -s -XDELETE 'http://localhost:9200/status/' >  /dev/null

echo "Deleted status index"

# http://localhost:9200/status/_mapping/status?pretty

echo "Creating status index with mapping"

curl -s -XPOST localhost:9200/status -d '
{
   "settings": {
      "index": {
         "number_of_shards": 10,
         "number_of_replicas": 1,
         "refresh_interval": "5s"
      }
   },
   "mappings": {
      "status": {
         "dynamic_templates": [{
            "metadata": {
               "path_match": "metadata.*",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "index": "not_analyzed"
               }
            }
         }],
         "_source": {
            "enabled": true
         },
         "_all": {
            "enabled": false
         },
         "properties": {
            "nextFetchDate": {
               "type": "date",
               "format": "dateOptionalTime"
            },
            "status": {
               "type": "string",
               "index": "not_analyzed"
            },
            "url": {
               "type": "string",
               "index": "not_analyzed"
            }
         }
      }
   }
}'

# deletes and recreates a status index with a bespoke schema

curl -s -XDELETE 'http://localhost:9200/metrics*/' >  /dev/null

echo ""
echo "Deleted metrics index"

echo "Creating metrics index with mapping"

# http://localhost:9200/metrics/_mapping/status?pretty
curl -s -XPOST localhost:9200/_template/storm-metrics-template -d '
{
  "template": "metrics*",
  "settings": {
    "index": {
      "number_of_shards": 1,
      "refresh_interval": "5s"
    },
    "number_of_replicas" : 0
  },
  "mappings": {
    "datapoint": {
      "_all":            { "enabled": false },
      "_source":         { "enabled": true },
      "properties": {
          "name": {
            "type": "string",
            "index": "not_analyzed"
          },
          "srcComponentId": {
            "type": "string",
            "index": "not_analyzed"
          },
          "srcTaskId": {
            "type": "long"
          },
          "srcWorkerHost": {
            "type": "string",
            "index": "not_analyzed"
          },
          "srcWorkerPort": {
            "type": "long"
          },
          "timestamp": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "value": {
            "type": "double"
          }
      }
    }
  }
}'

echo ""

For The topology.max.spout.pending: change to 100

Basicly we are currently crawling daily almost 30M url (including all parsing , meta data , indexing etc.)

We try to understand advantages (specially for the speeding up the process) of the storm-crawler first of all..

On 27 Ara 2016, at 22:39, Julien Nioche notifications@github.com wrote:

Using [https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/ESCrawlTopology.java]? Make sure the number of shards is the same as what the ES init script specified.

topology.max.spout.pending: 10 is likely to be too low for 200 threads.

The AggregationSpout or SamplerAggregationSpout is likely to give you better performance, especially as the index starts growing. I might change the example topology to that

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DigitalPebble/storm-crawler/issues/396#issuecomment-269372812, or mute the thread https://github.com/notifications/unsubscribe-auth/AQscn56xb8pCSAHMParMrmBfwvg1I5gXks5rMWlqgaJpZM4LWNgj.

jnioche commented 7 years ago

the max.spout.pending value is per spout instance. You are using 10 which is also the # of shards, so it's all good.

Note that the screenshots you took are probably an average over the fetcher bolt instances - you have 50 of them. A better way of assessing the speed and bottlenecks is by looking at the Storm UI on port 8080 and in the resulting index.

using the FetcherBolt (a single instance will do) will give you multithreading per host - which the SimpleFetcherBolt won't do + more intelligible metrics. As I pointed out earlier you'll get better perfs with the AggregationBolt.

In any way, check the Storm UI and logs to get a better understanding of the perfs. SC give you plenty of options for fine-tuning and it takes a bit of time to get used to the way it works.

For the sake of comparison \: I'll be publishing a blog post next week with a comparison with Apache Nutch on a single machine over 1K seed URLs. Don't want to spoil the suspense but StormCrawler comes on top ;-)

jnioche commented 7 years ago

Closing for now - feel free to open a new issue if you find something which looks like a bug or want a new features. For general questions, the mailing list or stack overflow would probably be better. Thanks!

MyraBaba commented 7 years ago

Hi Again,

As you suggested I change the fetcher to the FetcherBolt instead of the SimpleFetcherBolt in the elastic search config. Now I have a lot of error:

org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool


68648 [Thread-24-spout-executor[33 33]] INFO c.d.s.e.p.ElasticSearchSpout - ES query returned 100 hits in 67 msec 68653 [Thread-66-spout-executor[27 27]] INFO c.d.s.e.p.ElasticSearchSpout - ES query returned 100 hits in 72 msec 68653 [Thread-48-fetch-executor[13 13]] INFO c.d.s.b.FetcherBolt - [Fetcher #13] Threads : 5 queues : 4 in_queues : 1090 68654 [Thread-15-fetch-executor[7 7]] INFO c.d.s.b.FetcherBolt - [Fetcher #7] Threads : 3 queues : 2 in_queues : 207 68654 [Thread-50-fetch-executor[6 6]] INFO c.d.s.b.FetcherBolt - [Fetcher #6] Threads : 5 queues : 9 in_queues : 1045 68654 [Thread-50-fetch-executor[6 6]] INFO c.d.s.b.FetcherBolt - [Fetcher #6] Threads : 5 queues : 9 in_queues : 1046 68655 [Thread-15-fetch-executor[7 7]] INFO c.d.s.b.FetcherBolt - [Fetcher #7] Threads : 3 queues : 2 in_queues : 208 68660 [FetcherThread] ERROR c.d.s.b.FetcherBolt - Exception while fetching http://www.ajansspor.com/futbol/takim/bate_borisov/ org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:71) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:220) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:164) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:139) ~[httpclient-4.4.1.jar:4.4.1] at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:148) ~[classes/:?] at com.digitalpebble.stormcrawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:493) [classes/:?] 68660 [Thread-15-fetch-executor[7 7]] INFO c.d.s.b.FetcherBolt - [Fetcher #7] Threads : 3 queues : 2 in_queues : 209 68662 [Thread-50-fetch-executor[6 6]] INFO c.d.s.b.FetcherBolt - [Fetcher #6] Threads : 5 queues : 9 in_queues : 1047 68662 [Thread-50-fetch-executor[6 6]] INFO c.d.s.b.FetcherBolt - [Fetcher #6] Threads : 5 queues : 9 in_queues : 1048 68663 [Thread-15-fetch-executor[7 7]] INFO c.d.s.b.FetcherBolt - [Fetcher #7] Threads : 3 queues : 2 in_queues : 210 68663 [Thread-48-fetch-executor[13 13]] INFO c.d.s.b.FetcherBolt - [Fetcher #13] Threads : 5 queues : 4 in_queues : 1091 68663 [Thread-48-fetch-executor[13 13]] INFO c.d.s.b.FetcherBolt - [Fetcher #13] Threads : 5 queues : 4 in_queues : 1092 68666 [Thread-68-spout-executor[29 29]] INFO c.d.s.e.p.ElasticSearchSpout - ES query returned 100 hits in 82 msec 68672 [FetcherThread] INFO c.d.s.b.FetcherBolt - [Fetcher #15] Fetched http://www.sporx.com/motorsporlari/diger/ with status 200 in msec 290 68672 [FetcherThread] ERROR c.d.s.b.FetcherBolt - Exception while fetching http://www.sporx.com/rio2016/_assets/ajax/branslar.php?id=35 org.apache.http.NoHttpResponseException: The target server failed to respond at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261) ~[httpcore-4.4.1.jar:4.4.1] at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165) ~[httpcore-4.4.1.jar:4.4.1] at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272) ~[httpcore-4.4.1.jar:4.4.1] at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124) ~[httpcore-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:71) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:220) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:164) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:139) ~[httpclient-4.4.1.jar:4.4.1] at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:148) ~[classes/:?] at com.digitalpebble.stormcrawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:493) [classes/:?] 68673 [FetcherThread] ERROR c.d.s.b.FetcherBolt - Exception while fetching http://www.star.com.tr/teog-sinav-bilgisi/ org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:71) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:220) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:164) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:139) ~[httpclient-4.4.1.jar:4.4.1] at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:148) ~[classes/:?] at com.digitalpebble.stormcrawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:493) [classes/:?] 68673 [FetcherThread] ERROR c.d.s.b.FetcherBolt - Exception while fetching http://www.star.com.tr/teog-sinav-bilgisi/ org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:71) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:220) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:164) ~[httpclient-4.4.1.jar:4.4.1] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:139) ~[httpclient-4.4.1.jar:4.4.1] at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:148) ~[classes/:?] at com.digitalpebble.stormcrawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:493) [classes/:?]


On 28 Ara 2016, at 17:10, Julien Nioche notifications@github.com wrote:

the max.spout.pending value is per spout instance. You are using 10 which is also the # of shards, so it's all good.

Note that the screenshots you took are probably an average over the fetcher bolt instances - you have 50 of them. A better way of assessing the speed and bottlenecks is by looking at the Storm UI on port 8080 and in the resulting index.

using the FetcherBolt (a single instance will do) will give you multithreading per host - which the SimpleFetcherBolt won't do + more intelligible metrics. As I pointed out earlier you'll get better perfs with the AggregationBolt.

In any way, check the Storm UI and logs to get a better understanding of the perfs. SC give you plenty of options for fine-tuning and it takes a bit of time to get used to the way it works.

For the sake of comparison : I'll be publishing a blog post next week with a comparison with Apache Nutch on a single machine over 1K seed URLs. Don't want to spoil the suspense but StormCrawler comes on top ;-)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DigitalPebble/storm-crawler/issues/396#issuecomment-269481926, or mute the thread https://github.com/notifications/unsubscribe-auth/AQscn-ZFowBq8kf67n9PiLYMRUqFlNPmks5rMm3qgaJpZM4LWNgj.

jnioche commented 7 years ago

You should use only 1 instance of the FetcherBolt per worker - they are all competing for connections from the protocol. The difference between the SimpleFetcherBolt and the FetcherBolt is that the latter is a single instance creating sub fetching threads whereas the SFB instances are the fetching threads.