GoogleCloudPlatform / fluent-plugin-google-cloud

Plugin for Fluentd that sends logs to the Google Cloud Platform's log ingestion API.
Apache License 2.0
184 stars 97 forks source link

RESOURCE_EXHAUSTED: Maximum length exceeded #493

Open pieterjanpintens opened 2 years ago

pieterjanpintens commented 2 years ago

We are seeing these errors like this in our logs from fluentd.

2022-09-18 21:50:00.944377087 +0000 fluent.warn: {"error":"3:Data decompression failed with decompression status: RESOURCE_EXHAUSTED: Maximum length exceeded: 10485760; at byte 755395; at uncompressed byte 10485760. debug_error_string:{\"created\":\"@1663537800.943751927\",\"description\":\"Error received from peer ipv4:216.239.34.174:443\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":905,\"grpc_message\":\"Data decompression failed with decompression status: RESOURCE_EXHAUSTED: Maximum length exceeded: 10485760; at byte 755395; at uncompressed byte 10485760\",\"grpc_status\":3}","error_code":"3","message":"Dropping 4805 log message(s) error=\"3:Data decompression failed with decompression status: RESOURCE_EXHAUSTED: Maximum length exceeded: 10485760; at byte 755395; at uncompressed byte 10485760. debug_error_string:{\\\"created\\\":\\\"@1663537800.943751927\\\",\\\"description\\\":\\\"Error received from peer ipv4:216.239.34.174:443\\\",\\\"file\\\":\\\"src/core/lib/surface/call.cc\\\",\\\"file_line\\\":905,\\\"grpc_message\\\":\\\"Data decompression failed with decompression status: RESOURCE_EXHAUSTED: Maximum length exceeded: 10485760; at byte 755395; at uncompressed byte 10485760\\\",\\\"grpc_status\\\":3}\" error_code=\"3\""}

Our setup is a batch like system that processes big log files from s3. Out config is like this. We tried to set the buffer_chunk_limit low but it does not help.

<match **>
    @type google_cloud
    @log_level debug
    # prevents errors in logs,it will fail anyway
    use_metadata_service false
    label_map {
      "environment": "environment",
      "project": "project",
      "branch": "branch",
      "function": "function",
      "program": "program",
      "stream": "log"
    }
    # Set the chunk limit conservatively to avoid exceeding the recommended
    # chunk size of 10MB per write request. The API request size can be a few
    # times bigger than the raw log size.
    buffer_chunk_limit 512KB
    # Flush logs every 5 seconds, even if the buffer is not full.
    flush_interval 5s
    # Enforce some limit on the number of retries.
    disable_retry_limit false
    # After 3 retries, a given chunk will be discarded.
    retry_limit 3
    # Wait 10 seconds before the first retry. The wait interval will be doubled on
    # each following retry (20s, 40s...) until it hits the retry limit.
    retry_wait 10
    # Never wait longer than 5 minutes between retries. If the wait interval
    # reaches this limit, the exponentiation stops.
    # Given the default config, this limit should never be reached, but if
    # retry_limit and retry_wait are customized, this limit might take effect.
    max_retry_wait 300
    # Use multiple threads for processing.
    num_threads 8
    # Use the gRPC transport.
    use_grpc true
    # Try to limit the size of the uploaded data
    grpc_compression_algorithm gzip
    # If a request is a mix of valid log entries and invalid ones, ingest the
    # valid ones and drop the invalid ones instead of dropping everything.
    partial_success true
    <buffer>
      @type memory
      timekey 60
      timekey_wait 10
      overflow_action block
    </buffer>
</match>

Looking futher down the line it seems that you can specify a channel option on grcp channel: _GRPC_ARG_MAX_SEND_MESSAGELENGTH. Reading about it I wonder if setting this option would solve this problem? It currently is not exposed to the fluentd config. By default it is set to -1? Not sure if grcp would split the message or if it would just turn the server error into a client error...

We are looking for guidance on how we should proceed

pieterjanpintens commented 2 years ago

Looking at the code it seems that log entries are bundled per tag before sending them out. Would it make sense to set a limit on the number of entries in each send operation and split the entries over multiple send operations when needed? I think this allows to limit the outgoing message size.