amazon-archives / kinesis-storm-spout

Kinesis spout for Storm
Other
106 stars 64 forks source link

How to increase the number of actual working spouts? #3

Open Sunoyon opened 10 years ago

Sunoyon commented 10 years ago

Hi I've used the following configuration. However the throughput is not as expected.

Kinesis side: Number of shards: 5 Around 500,000 rows in kinesis stream. My topology: builder.setSpout("kinesis_spout", spout, 10); BoltDeclarer bolt = builder.setBolt("redis", new KinesisSplitLog(), 50).setNumTasks(200); bolt.shuffleGrouping("kinesis_spout");

and the configuration: conf.setNumAckers(0); conf.setDebug(false); conf.setNumWorkers(5); conf.setMaxSpoutPending(5000);

I used TRIM_HORIZON

Result:

  1. Only 3 (sometimes 2) spouts fetch records from kinesis, though the shard number is 5.
  2. The processing rate is ~2000 records/sec at the very first minute. And gradually the rate is decreased and after 2-3 minute the rate is decreased to 200-300 records/sec.

My question is:

  1. How can 5 spouts fetch records concurrently?
  2. My required rate is to process 10,000 records/sec. What would the configuration of topology and thread number of spout and bolt to reach this requirement?

Thanks in advance Sunoyon

Sunoyon commented 10 years ago

Finally 5 spouts of my topology can fetch records concurrently. Basically I had a wrong idea regarding partition key and shards of kinesis stream.

The following link says that, "From a design standpoint, to ensure that all your shards are well utilized, the number of shards (specified by the setShardCount() method of CreateStreamRequest) should be substantially less than the number of unique partition keys, and the amount of data flowing to a single partition key should be substantially less than the capacity of a shard."

http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-using-api-java.html#kinesis-using-api-defn-partition-key

So I uploaded data with 30 partition keys, now records are available in 9 shards. So 9 spout can fetch records concurrently. And current rate is quite good (5000 records/sec).

Could please tell me, my idea about partition key and shards is correct ?

Thanks Sunoyon

gauravgh commented 10 years ago

Hi Sunoyon,

Yes, your approach above of picking partition keys such that lots of them map to the same shard is correct. To your earlier mention about being able to get 10,000 records per second, you'll want to provision the appropriate number of shards. Please see http://docs.aws.amazon.com/kinesis/latest/dev/service-sizes-and-limits.html for information about current Amazon Kinesis sizes and limits.

Please see http://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html for information on how to request a limit increase in case you'd like to use more shards than the current default limit.

Sincerely, Gaurav

Sunoyon commented 10 years ago

Hi Gaurav Thanks for your help. Currently I am using a stream of 10 shards. I can always get data using 10 spouts (though I tried with 100 spouts). Your link tells that, "Each shard can support up to 5 read transactions per second up to a maximum total of 2 MB of data read per second." My query is:

  1. How could i know all shards have data?
  2. If I can use 10 spout, does it mean only one spout can connect per shard?
  3. How can I use more spouts than 10?

Thanks Sunoyon

gauravgh commented 10 years ago

Hi Sunoyon,

  1. How could i know all shards have data? On the ingestion side, we return the shardId as part of the PutRecord response. You can use it to monitor whether you are putting data into all shards. On the processing side i.e. spout, you can monitor the checkpoints in Zookeeper to see if checkpoints for all shards are being updated. You can also monitor the spout logs.
  2. If I can use 10 spout, does it mean only one spout can connect per shard? The spout will read data in sequence from each shard. If you have 10 spout tasks, the 10 shards will be split among them.
  3. How can I use more spouts than 10? You can use more shards. Note, Kinesis will return a batch of records for each GetRecords request. Can you tell us more about your use case? That'll help us better answer your question.

I'd recommend starting a thread on our forum (https://forums.aws.amazon.com/forum.jspa?forumID=169)? We're better able to answer application design questions via the forums.

Thank you, Gaurav