awslabs / amazon-kinesis-producer

Amazon Kinesis Producer Library
Apache License 2.0
398 stars 331 forks source link

Stream socket continuously resets (EOF) in low traffic environment #97

Open dandesousa opened 7 years ago

dandesousa commented 7 years ago

We've noticed a lot of noise in production lately and a special case of some issue that might have presented with some others (https://github.com/awslabs/amazon-kinesis-producer/issues/17).

It appears as though there are is some odd behavior in the KPL when dealing with steady, low traffic streams, the KPL constantly emitting following error:

[2017-03-29 15:31:12.719017] [0x00007fa00f7d7700] [error] [io_service_socket.h:229] Error during socket read: End of file; 0 bytes read so far (kinesis.us-east-1.amazonaws.com:443)

There are no other errors in the log, just the above over and over.

For clarity around this behavior this I ran the KPL with aggregation disabled, such that one record sent to the KPL = one record in CloudWatch metrics. The stream has only one shard, and all records go to that shard.

I start by pushing a single record (~1k in size) every 10 seconds, producing approximately 6 records per minute. When doing so, the KPL handles it find, we see it in the metrics and there are no errors.

screen shot 2017-03-29 at 11 26 54 am

Then, I send 20 records each minute. When doing so, the KPL steadily throws socket end of file errors from the kinesis endpoint around every 10-15 seconds or so. This continues as long as the traffic remains steady in that range.

screen shot 2017-03-29 at 11 27 08 am

Then, if I increase the traffic to 60 rpm, the errors disappear entirely.

screen shot 2017-03-29 at 11 27 18 am

Finally I can taper down the requests and the behavior presents itself again (full timeline back to 20 rpm):

screen shot 2017-03-29 at 11 27 24 am

This behavior is present if we aggregate records or not, is reproducible across multiple different streams, regardless of the number of shards.

No obvious data seems to be getting lost here. But it disrupts our ability to detect true errors from the KPL and generates a ton of noise.

I’ve iterated on every relevant looking KPL setting to try and get this to go away to no avail (tweaked timeouts, record aggregation, etc).

KPL version 0.10.2 was used to do the test.

Any idea what is going on here?

Kinesis settings below:

# You can load a properties file with
# KinesisProducerConfiguration.fromPropertiesFile(String)
#
# Any fields not found in the properties file will take on default values.
#
# The values loaded are checked against any constraints that each respective
# field may have. If there are invalid values an IllegalArgumentException will
# be thrown.

# Enable aggregation. With aggregation, multiple user records are packed into
# a single KinesisRecord. If disabled, each user record is sent in its own
# KinesisRecord.
#
# If your records are small, enabling aggregation will allow you to put many
# more records than you would otherwise be able to for a shard before getting
# throttled.
#
# Default: true
AggregationEnabled = false

# Maximum number of items to pack into an aggregated record.
#
# There should be normally no need to adjust this. If you want to limit the
# time records spend buffering, look into record_max_buffered_time instead.
#
# Default: 4294967295
# Minimum: 1
# Maximum (inclusive): 9223372036854775807
AggregationMaxCount = 4294967295

# Maximum number of bytes to pack into an aggregated Kinesis record.
#
# There should be normally no need to adjust this. If you want to limit the
# time records spend buffering, look into record_max_buffered_time instead.
#
# If a record has more data by itself than this limit, it will bypass the
# aggregator. Note the backend enforces a limit of 50KB on record size. If
# you set this beyond 50KB, oversize records will be rejected at the backend.
#
# Default: 51200
# Minimum: 64
# Maximum (inclusive): 1048576
AggregationMaxSize = 51200

# Maximum number of items to pack into an PutRecords request.
#
# There should be normally no need to adjust this. If you want to limit the
# time records spend buffering, look into record_max_buffered_time instead.
#
# Default: 500
# Minimum: 1
# Maximum (inclusive): 500
CollectionMaxCount = 500

# Maximum amount of data to send with a PutRecords request.
#
# There should be normally no need to adjust this. If you want to limit the
# time records spend buffering, look into record_max_buffered_time instead.
#
# Records larger than the limit will still be sent, but will not be grouped
# with others.
#
# Default: 5242880
# Minimum: 52224
# Maximum (inclusive): 9223372036854775807
CollectionMaxSize = 5242880

# Timeout (milliseconds) for establishing TLS connections.
#
# Default: 6000
# Minimum: 100
# Maximum (inclusive): 300000
ConnectTimeout = 6000

# Use a custom Kinesis and CloudWatch endpoint.
#
# Mostly for testing use. Note this does not accept protocols or paths, only
# host names or ip addresses. There is no way to disable TLS. The KPL always
# connects with TLS.
#
# Expected pattern: ^([A-Za-z0-9-\\.]+)?$
# CustomEndpoint =

# If true, throttled puts are not retried. The records that got throttled
# will be failed immediately upon receiving the throttling error. This is
# useful if you want to react immediately to any throttling without waiting
# for the KPL to retry. For example, you can use a different hash key to send
# the throttled record to a backup shard.
#
# If false, the KPL will automatically retry throttled puts. The KPL performs
# backoff for shards that it has received throttling errors from, and will
# avoid flooding them with retries. Note that records may fail from
# expiration (see record_ttl) if they get delayed for too long because of
# throttling.
#
# Default: false
FailIfThrottled = false

# Minimum level of logs. Messages below the specified level will not be
# logged. Logs for the native KPL daemon show up on stderr.
#
# Default: info
# Expected pattern: info|warning|error
LogLevel = info

# Maximum number of connections to open to the backend. HTTP requests are
# sent in parallel over multiple connections.
#
# Setting this too high may impact latency and consume additional resources
# without increasing throughput.
#
# Default: 4
# Minimum: 1
# Maximum (inclusive): 128
MaxConnections = 16

# Controls the granularity of metrics that are uploaded to CloudWatch.
# Greater granularity produces more metrics.
#
# When "shard" is selected, metrics are emitted with the stream name and
# shard id as dimensions. On top of this, the same metric is also emitted
# with only the stream name dimension, and lastly, without the stream name.
# This means for a particular metric, 2 streams with 2 shards (each) will
# produce 7 CloudWatch metrics, one for each shard, one for each stream, and
# one overall, all describing the same statistics, but at different levels of
# granularity.
#
# When "stream" is selected, per shard metrics are not uploaded; when
# "global" is selected, only the total aggregate for all streams and all
# shards are uploaded.
#
# Consider reducing the granularity if you're not interested in shard-level
# metrics, or if you have a large number of shards.
#
# If you only have 1 stream, select "global"; the global data will be
# equivalent to that for the stream.
#
# Refer to the metrics documentation for details about each metric.
#
# Default: shard
# Expected pattern: global|stream|shard
MetricsGranularity = stream

# Controls the number of metrics that are uploaded to CloudWatch.
#
# "none" disables all metrics.
#
# "summary" enables the following metrics: UserRecordsPut, KinesisRecordsPut,
# ErrorsByCode, AllErrors, BufferingTime.
#
# "detailed" enables all remaining metrics.
#
# Refer to the metrics documentation for details about each metric.
#
# Default: detailed
# Expected pattern: none|summary|detailed
MetricsLevel = summary

# The namespace to upload metrics under.
#
# If you have multiple applications running the KPL under the same AWS
# account, you should use a different namespace for each application.
#
# If you are also using the KCL, you may wish to use the application name you
# have configured for the KCL as the the namespace here. This way both your
# KPL and KCL metrics show up under the same namespace.
#
# Default: KinesisProducerLibrary
# Expected pattern: (?!AWS/).{1,255}
MetricsNamespace = KinesisProducerLibrary

# Delay (in milliseconds) between each metrics upload.
#
# For testing only. There is no benefit in setting this lower or higher in
# production.
#
# Default: 60000
# Minimum: 1
# Maximum (inclusive): 60000
MetricsUploadDelay = 60000

# Minimum number of connections to keep open to the backend.
#
# There should be no need to increase this in general.
#
# Default: 1
# Minimum: 1
# Maximum (inclusive): 16
MinConnections = 1

# Server port to connect to. Only useful with custom_endpoint.
#
# Default: 443
# Minimum: 1
# Maximum (inclusive): 65535
Port = 443

# Limits the maximum allowed put rate for a shard, as a percentage of the
# backend limits.
#
# The rate limit prevents the producer from sending data too fast to a shard.
# Such a limit is useful for reducing bandwidth and CPU cycle wastage from
# sending requests that we know are going to fail from throttling.
#
# Kinesis enforces limits on both the number of records and number of bytes
# per second. This setting applies to both.
#
# The default value of 150% is chosen to allow a single producer instance to
# completely saturate the allowance for a shard. This is an aggressive
# setting. If you prefer to reduce throttling errors rather than completely
# saturate the shard, consider reducing this setting.
#
# Default: 150
# Minimum: 1
# Maximum (inclusive): 9223372036854775807
RateLimit = 150

# Maximum amount of itme (milliseconds) a record may spend being buffered
# before it gets sent. Records may be sent sooner than this depending on the
# other buffering limits.
#
# This setting provides coarse ordering among records - any two records will
# be reordered by no more than twice this amount (assuming no failures and
# retries and equal network latency).
#
# The library makes a best effort to enforce this time, but cannot guarantee
# that it will be precisely met. In general, if the CPU is not overloaded,
# the library will meet this deadline to within 10ms.
#
# Failures and retries can additionally increase the amount of time records
# spend in the KPL. If your application cannot tolerate late records, use the
# record_ttl setting to drop records that do not get transmitted in time.
#
# Setting this too low can negatively impact throughput.
#
# Default: 100
# Maximum (inclusive): 9223372036854775807
RecordMaxBufferedTime = 5000

# Set a time-to-live on records (milliseconds). Records that do not get
# successfully put within the limit are failed.
#
# This setting is useful if your application cannot or does not wish to
# tolerate late records. Records will still incur network latency after they
# leave the KPL, so take that into consideration when choosing a value for
# this setting.
#
# If you do not wish to lose records and prefer to retry indefinitely, set
# record_ttl to a large value like INT_MAX. This has the potential to cause
# head-of-line blocking if network issues or throttling occur. You can
# respond to such situations by using the metrics reporting functions of the
# KPL. You may also set fail_if_throttled to true to prevent automatic
# retries in case of throttling.
#
# Default: 30000
# Minimum: 100
# Maximum (inclusive): 9223372036854775807
RecordTtl = 30000

# Which region to send records to.
#
# If you do not specify the region and are running in EC2, the library will
# use the region the instance is in.
#
# The region is also used to sign requests.
#
# Expected pattern: ^([a-z]+-[a-z]+-[0-9])?$
Region = us-east-1

# The maximum total time (milliseconds) elapsed between when we begin a HTTP
# request and receiving all of the response. If it goes over, the request
# will be timed-out.
#
# Note that a timed-out request may actually succeed at the backend. Retrying
# then leads to duplicates. Setting the timeout too low will therefore
# increase the probability of duplicates.
#
# Default: 6000
# Minimum: 100
# Maximum (inclusive): 600000
RequestTimeout = 6000

# Verify the endpoint's certificate. Do not disable unless using
# custom_endpoint for testing. Never disable this in production.
#
# Default: true
VerifyCertificate = true
pfifer commented 7 years ago

Low write rates make it more likely that the sockets the KPL is holding will be reset, which it only finds out when it goes to make request on the socket. Are you able to try the newer 0.12.x version of the KPL? It uses the C++ AWS SDK, which uses curl. curl may handle the connection reset's better than the 0.10.x socket management.