fluent / fluent-plugin-opensearch

OpenSearch Plugin for Fluentd
Apache License 2.0
58 stars 21 forks source link

nil:nilclass error connecting to opensearch #42

Open LeonD9 opened 2 years ago

LeonD9 commented 2 years ago

Steps to replicate

Opensearch out buffer config:

    <label @opensearch>
      <match **>
        @type opensearch
        @id logs
        include_tag_key true
        time_key time
        scheme "#{ENV['FLUENT_OPENSEARCH_SCHEME'] || 'https'}"
        ssl_verify "#{ENV['FLUENT_OPENSEARCH_SSL_VERIFY'] || 'true'}"
        ssl_version "#{ENV['FLUENT_OPENSEARCH_SSL_VERSION'] || 'TLSv1_2'}"
        reload_connections "#{ENV['FLUENT_OPENSEARCH_RELOAD_CONNECTIONS'] || 'false'}"
        reconnect_on_error "#{ENV['FLUENT_OPENSEARCH_RECONNECT_ON_ERROR'] || 'true'}"
        reload_on_failure "#{ENV['FLUENT_OPENSEARCH_RELOAD_ON_FAILURE'] || 'true'}"
        log_os_400_reason "#{ENV['FLUENT_OPENSEARCH_LOG_ES_400_REASON'] || 'true'}"
        with_transporter_log false
        logstash_prefix 'k8s'
        logstash_dateformat "#{ENV['FLUENT_OPENSEARCH_LOGSTASH_DATEFORMAT'] || '%Y.%m.%d'}"
        logstash_format "#{ENV['FLUENT_OPENSEARCH_LOGSTASH_FORMAT'] || 'true'}"
        index_name 'k8s'
        target_index_key "#{ENV['FLUENT_OPENSEARCH_TARGET_INDEX_KEY'] || use_nil}"
        type_name 'fluentd'
        include_timestamp "#{ENV['FLUENT_OPENSEARCH_INCLUDE_TIMESTAMP'] || 'false'}"
        template_name "#{ENV['FLUENT_OPENSEARCH_TEMPLATE_NAME'] || use_nil}"
        template_file "#{ENV['FLUENT_OPENSEARCH_TEMPLATE_FILE'] || use_nil}"
        template_overwrite "#{ENV['FLUENT_OPENSEARCH_TEMPLATE_OVERWRITE'] || use_default}"
        request_timeout "#{ENV['FLUENT_OPENSEARCH_REQUEST_TIMEOUT'] || '20s'}"
        suppress_type_name "#{ENV['FLUENT_OPENSEARCH_SUPPRESS_TYPE_NAME'] || 'true'}"
        <endpoint>
          url "#{ENV['FLUENT_MAIN_OPENSEARCH_HOST']}"
          region "#{ENV['OPENSEARCH_REGION']}"
          assume_role_arn "#{ENV['ASSUMED_IAM_ROLE']}"
        </endpoint>
        <buffer>
          @type file
          flush_thread_count '16'
          flush_interval '10s'
          chunk_limit_size '10MB'
          chunk_limit_records '3000'
          total_limit_size '65GB'
          retry_max_interval '30'
          flush_mode 'interval'
          overflow_action 'drop_oldest_chunk'
        </buffer>
      </match>
    </label>

I receive the following error when trying to send logs to opensearch:

  2022-03-28 14:28:44 +0000 [warn]: #0 [opensearch] failed to flush the buffer. retry_times=5 next_retry_time=2022-03-28 14:29:00 +0000 chunk="5db481db7caf87bc9550c195959bd79b" error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"****\", :port=>443, :scheme=>\"https\"}): undefined method `call' for nil:NilClass"

what can be the reason for this error?

Expected Behavior or What you need to ask

Access is opened to opensearch and logs arrive from time to time but i keep receiving this error most of the time, what can be the reason for this error?

Using Fluentd and OpenSearch plugin versions

ryn9 commented 2 years ago

Please try the following and report back your results...

get rid of 'type_name' entry

either: -get rid of 'include_tag_key true' -add 'tag_key' entry

LeonD9 commented 2 years ago

@ryn9 removed type_name and include_tag_key didn't help i still continue to get the following error:


2022-03-28 14:28:44 +0000 [warn]: #0 [opensearch] failed to flush the buffer. retry_times=5 next_retry_time=2022-03-28 14:29:00 +0000 chunk="5db481db7caf87bc9550c195959bd79b" error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"****\", :port=>443, :scheme=>\"https\"}): undefined method `call' for nil:NilClass"
LeonD9 commented 2 years ago

It seems its an issue related to fluent-plugin-opensearch latest version, reverted to 1.0.2 fixed my issue.

ryn9 commented 2 years ago

@cosmo0920 - perhaps a bug in 1.0.3 ?

ryn9 commented 2 years ago

@cosmo0920 - I hit the same bug when using 1.0.3

cosmo0920 commented 2 years ago

Sure, I'll take a look.

cosmo0920 commented 2 years ago

Could you paste around the error? Ruby is script language. We should know where is error occurred and how it is caused.

ryn9 commented 2 years ago

@cosmo0920

Here is a 1.0.3 error

2022-03-30 20:33:16 +0000 [debug]: #0 [out_opensearch] Created new chunk chunk_id="5db75748cfce7588a2607967c0bca145" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="test", variables=nil, seq=0>
2022-03-30 20:33:20 +0000 [debug]: #0 [out_opensearch] taking back chunk for errors. chunk="5db75748cfce7588a2607967c0bca145"
2022-03-30 20:33:20 +0000 [warn]: #0 [out_opensearch] failed to flush the buffer. retry_times=0 next_retry_time=2022-03-30 20:33:50 +0000 chunk="5db75748cfce7588a2607967c0bca145" error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"REDACTED\", :port=>443, :scheme=>\"https\", :user=>\"REDACTED\", :password=>\"obfuscated\"}): undefined method `call' for nil:NilClass"
  2022-03-30 20:33:20 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:1119:in `rescue in send_bulk'
  2022-03-30 20:33:20 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:1081:in `send_bulk'
  2022-03-30 20:33:20 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:892:in `block in write'
  2022-03-30 20:33:20 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:891:in `each'
  2022-03-30 20:33:20 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:891:in `write'
  2022-03-30 20:33:20 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:1179:in `try_flush'
  2022-03-30 20:33:20 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:1495:in `flush_thread_run'
  2022-03-30 20:33:20 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:499:in `block (2 levels) in start'
  2022-03-30 20:33:20 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.14.5/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2022-03-30 20:33:50 +0000 [debug]: #0 [out_opensearch] taking back chunk for errors. chunk="5db75748cfce7588a2607967c0bca145"
2022-03-30 20:33:50 +0000 [warn]: #0 [out_opensearch] failed to flush the buffer. retry_times=1 next_retry_time=2022-03-30 20:34:24 +0000 chunk="5db75748cfce7588a2607967c0bca145" error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"REDACTED\", :port=>443, :scheme=>\"https\", :user=>\"REDACTED\", :password=>\"obfuscated\"}): undefined method `call' for nil:NilClass"
  2022-03-30 20:33:50 +0000 [warn]: #0 suppressed same stacktrace

As compared to same config and same test messages being sent through 1.0.2 2022-03-30 20:39:57 +0000 [debug]: #0 [out_opensearch] Created new chunk chunk_id="5db758c75713c0982038aed5cf73d4cd" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="test", variables=nil, seq=0> And then it flushed successfully without any logged messages

LeonD9 commented 2 years ago

@cosmo0920 @ryn9 did you guys also see this warning message in 1.0.2 gem: 2022-04-04 07:15:53 +0000 [warn]: #0 emit transaction failed: error_class=NoMethodError error="undefined method `+' for nil:NilClass" location="/fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/output.rb:1341:in `submit_flush_once'" tag="kubernetes.var.log.containers.kube-metrics-adapter-6f75fbbbc4-s5smw_kube-system_kube-metrics-adapter-ee0fb2300dec03ae655ae7ad6e094e98dcebb26b61ac9e0213647a9cb015f763.log" 2022-04-04 07:15:53 +0000 [warn]: #0 suppressed same stacktrace

any idea what can cause this error?

cosmo0920 commented 2 years ago

@LeonD9 Could you paste more around error? Ruby's backtrace is also important to digging error reason.

ryn9 commented 2 years ago

@cosmo0920 - I ran this again - tracing - and it looks like this occurs in the bulk error condition. If I send a payload with no errors - it appears to work okay. When I send a payload with an error - it errors like above.

Here is a slightly modified trace:

2022-04-05 15:49:21 +0000 [trace]: #0 [out_opensearch] writing events into buffer instance=3000 metadata_size=1
2022-04-05 15:49:21 +0000 [debug]: #0 [out_opensearch] Created new chunk chunk_id="5dbea303f4f77964612f171a1b631b04" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="test", variables=nil, seq=0>
2022-04-05 15:49:21 +0000 [trace]: #0 [out_opensearch] chunk /fluentd/root_dir/worker0/out_opensearch/buffer/buffer.b5dbea303f4f77964612f171a1b631b04.log size_added: 206 new_size: 206
2022-04-05 15:49:22 +0000 [trace]: #2 [out_opensearch] enqueueing all chunks in buffer instance=3000
2022-04-05 15:49:22 +0000 [trace]: #1 [out_opensearch] enqueueing all chunks in buffer instance=3000
2022-04-05 15:49:22 +0000 [trace]: #0 [out_opensearch] enqueueing all chunks in buffer instance=3000
2022-04-05 15:49:22 +0000 [trace]: #0 [out_opensearch] enqueueing chunk instance=3000 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="test", variables=nil, seq=0>
2022-04-05 15:49:22 +0000 [trace]: #3 [out_opensearch] enqueueing all chunks in buffer instance=3000
2022-04-05 15:49:22 +0000 [trace]: #0 [out_opensearch] dequeueing a chunk instance=3000
2022-04-05 15:49:22 +0000 [trace]: #0 [out_opensearch] chunk dequeued instance=3000 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="test", variables=nil, seq=0>
2022-04-05 15:49:22 +0000 [trace]: #0 [out_opensearch] trying flush for a chunk chunk="5dbea303f4f77964612f171a1b631b04"
2022-04-05 15:49:22 +0000 [trace]: #0 [out_opensearch] adding write count instance=3440
2022-04-05 15:49:22 +0000 [trace]: #0 [out_opensearch] executing sync write chunk="5dbea303f4f77964612f171a1b631b04"
2022-04-05 15:49:22 +0000 [trace]: #0 [out_opensearch] bulk request: {"index":{"_index":"demo-field1-field2-2022.13","_type":"_doc"}}
{"foo.bar":"test","message":"test","@timestamp":"2022-03-31T13:52:51.315315315+00:00"}

2022-04-05 15:49:22 +0000 [trace]: #0 [out_opensearch] bulk response: {"took"=>7, "errors"=>true, "items"=>[{"index"=>{"_index"=>"demo-field1-field2-2022.13", "_type"=>"_doc", "_id"=>"0Xxq-n8Br-TZ0qY9lARI", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"Could not dynamically add mapping for field [foo.bar]. Existing mapping for [foo] must be of type object but found [text]."}}}]}
2022-04-05 15:49:22 +0000 [debug]: #0 [out_opensearch] taking back chunk for errors. chunk="5dbea303f4f77964612f171a1b631b04"
2022-04-05 15:49:22 +0000 [trace]: #0 [out_opensearch] taking back a chunk instance=3000 chunk_id="5dbea303f4f77964612f171a1b631b04"
2022-04-05 15:49:22 +0000 [trace]: #0 [out_opensearch] chunk taken back instance=3000 chunk_id="5dbea303f4f77964612f171a1b631b04" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="test", variables=nil, seq=0>
2022-04-05 15:49:22 +0000 [warn]: #0 [out_opensearch] failed to flush the buffer. retry_times=0 next_retry_time=2022-04-05 15:49:27 +0000 chunk="5dbea303f4f77964612f171a1b631b04" error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"REDACTED\", :port=>443, :scheme=>\"https\", :user=>\"REDACTED\", :password=>\"obfuscated\"}): undefined method `call' for nil:NilClass"
  2022-04-05 15:49:22 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:1119:in `rescue in send_bulk'
  2022-04-05 15:49:22 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:1081:in `send_bulk'
  2022-04-05 15:49:22 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:892:in `block in write'
  2022-04-05 15:49:22 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:891:in `each'
  2022-04-05 15:49:22 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:891:in `write'
  2022-04-05 15:49:22 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:1179:in `try_flush'
  2022-04-05 15:49:22 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:1495:in `flush_thread_run'
  2022-04-05 15:49:22 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:499:in `block (2 levels) in start'
  2022-04-05 15:49:22 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.14.5/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2022-04-05 15:49:23 +0000 [trace]: #2 [out_opensearch] enqueueing all chunks in buffer instance=3000
2022-04-05 15:49:23 +0000 [trace]: #1 [out_opensearch] enqueueing all chunks in buffer instance=3000
2022-04-05 15:49:23 +0000 [trace]: #0 [out_opensearch] enqueueing all chunks in buffer instance=3000
2022-04-05 15:49:23 +0000 [trace]: #3 [out_opensearch] enqueueing all chunks in buffer instance=3000
2022-04-05 15:49:24 +0000 [trace]: #2 [out_opensearch] enqueueing all chunks in buffer instance=3000
LeonD9 commented 2 years ago

@cosmo0920 the same as this issue: https://github.com/fluent/fluentd/issues/1973 there is no more logs except this in debug/trace mode:

2022-04-16 19:25:26 +0000 [trace]: #0 [main-logs] writing events into buffer instance=458700 metadata_size=1
2022-04-16 19:25:26 +0000 [debug]: #0 [main-logs] Created new chunk chunk_id="5dcd50bbcfeaf9b7d6fc14f983017cbb" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="kubernetes.var.log.containers.test.log", variables=nil, seq=0>
2022-04-16 19:25:26 +0000 [trace]: #0 [main-logs] chunk /var/log/fluentd/worker0/main-logs/buffer/buffer.b5dcd50bbcfeaf9b7d6fc14f983017cbb.log size_added: 3499 new_size: 3499
2022-04-16 19:25:26 +0000 [warn]: #0 emit transaction failed: error_class=NoMethodError error="undefined method `+' for nil:NilClass" location="/fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/output.rb:1341:in `submit_flush_once'" tag="kubernetes.var.log.containers.test.log"
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/output.rb:1341:in `submit_flush_once'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/output.rb:853:in `emit_buffered'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/output.rb:830:in `emit_events'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/event_router.rb:115:in `emit_stream'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/out_relabel.rb:29:in `process'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/output.rb:839:in `emit_sync'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/event_router.rb:115:in `emit_stream'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/in_forward.rb:338:in `on_message'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/in_forward.rb:226:in `block in handle_connection'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/in_forward.rb:263:in `block (3 levels) in read_messages'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/in_forward.rb:262:in `feed_each'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/in_forward.rb:262:in `block (2 levels) in read_messages'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin/in_forward.rb:271:in `block in read_messages'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin_helper/server.rb:613:in `on_read_without_connection'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/cool.io-1.7.1/lib/cool.io/io.rb:123:in `on_readable'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/cool.io-1.7.1/lib/cool.io/io.rb:186:in `on_readable'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/cool.io-1.7.1/lib/cool.io/loop.rb:88:in `run_once'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/cool.io-1.7.1/lib/cool.io/loop.rb:88:in `run'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
  2022-04-16 19:25:26 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.14.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2022-04-16 19:25:26 +0000 [error]: #0 unexpected error on reading data host="10.1.7.177" port=42756 error_class=NoMethodError error="undefined method `get' for nil:NilClass"
ryn9 commented 2 years ago

@cosmo0920 anything more we can provide in order to get this resolved?

cosmo0920 commented 2 years ago

Ordinary records will not cause this issue. Could you provide a sample record that makes this error? Providing reproducible step is more better but sample invalid record should be needed for resolving this issue.

ghost commented 2 years ago

I've encountered this issue too. Interestingly, I never observed the problem before adding filters to retag records with JSON in the message & re-parse. fluentd.conf now looks like:

      <label @FLUENT_LOG>
        <match fluent.**>
          @type null
          @id ignore_fluent_logs
        </match>
      </label>

      <source>
        @type tail
        @id in_tail_container_logs
        path "/var/log/containers/*.log"
        pos_file "/var/log/fluentd-containers.log.pos"
        tag "kubernetes.*"
        exclude_path []
        read_from_head true
        <parse>
          @type "cri"
          time_format "%Y-%m-%dT%H:%M:%S.%N%z"
        </parse>
      </source>

      <filter kubernetes.**>
        @type kubernetes_metadata
        @id filter_kube_metadata
        kubernetes_url "https://172.30.0.1:443/api"
        verify_ssl true
        ca_file ""
        skip_labels false
        skip_container_metadata false
        skip_master_url false
        skip_namespace_metadata false
        watch true
      </filter>

      <match kubernetes.**>
        @type rewrite_tag_filter
        <rule>
          key     message
          pattern /^\{.*\}$/
          tag     application.json
        </rule>
        <rule>
          key     kubernetes.container_name
          pattern /^known-container$/
          tag     application.freetext
        </rule>
        <rule>
          key     message
          pattern /.*$/
          tag     unidentified
        </rule>
      </match>

      <filter application.json>
        @type parser
        key_name "message"
        reserve_data true
        remove_key_name_field true
        <parse>
          @type json
        </parse> 
      </filter>

      <match application.json>
        @type opensearch
        @id out_os_app_json
        include_tag_key true
        host "opensearch-cluster-master"
        port 9200
        path ""
        scheme https
        ssl_verify false
        ssl_version TLSv1_2
        user "admin"
        password "xxxxxxx"
        index_name "fluentd_application"
        request_timeout 30s # defaults to 5s
        suppress_type_name true
      </match>
      <match application.**>
        @type opensearch
        @id out_os_app
        include_tag_key true
        host "opensearch-cluster-master"
        port 9200
        path ""
        scheme https
        ssl_verify false
        ssl_version TLSv1_2
        user "admin"
        password "xxxxxxx"
        index_name "fluentd_application"
        request_timeout 30s # defaults to 5s
        suppress_type_name true
      </match>      
      # # Catchall...
      <match **>
        @type stdout
        @id unidentified_log
      </match>

Error output from fluentd:

2022-05-19 09:14:41 +0000 [warn]: #0 [out_os_app_json] failed to flush the buffer. retry_times=0 next_retry_time=2022-05-19 09:14:43 +0000 chunk="5df59ca32e546139145bdc1c15cf2986" error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"opensearch-cluster-master\", :port=>9200, :scheme=>\"https\", :user=>\"admin\", :password=>\"obfuscated\", :path=>\"\"}): undefined method `call' for nil:NilClass"
  2022-05-19 09:14:41 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:1119:in `rescue in send_bulk'
  2022-05-19 09:14:41 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:1081:in `send_bulk'
  2022-05-19 09:14:41 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:892:in `block in write'
  2022-05-19 09:14:41 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:891:in `each'
  2022-05-19 09:14:41 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluent-plugin-opensearch-1.0.3/lib/fluent/plugin/out_opensearch.rb:891:in `write'
  2022-05-19 09:14:41 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1179:in `try_flush'
  2022-05-19 09:14:41 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1500:in `flush_thread_run'
  2022-05-19 09:14:41 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:499:in `block (2 levels) in start'
  2022-05-19 09:14:41 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

I can't tell what record(s) are causing the problem, but when I switch the out_os_app_json output to stdout, a sample record looks like this.

{
    "stream"=>"stdout", 
    "logtag"=>"F", 
    "message"=>"{\"remote\":\"127.0.0.1:52400\",\"client\":\"conn216652\",\"doc\":{\"driver\":{\"name\":\"mongo-go-driver\",\"version\":\"v1.7.0\"},\"os\":{\"type\":\"linux\",\"architecture\":\"amd64\"},\"platform\":\"go1.17.5\",\"mongos\":{\"host\":\"shared-mongodb-mongos-5d65d4667c-n2fwv:27017\",\"client\":\"127.0.0.1:52400\",\"version\":\"4.4.10-11\"}}}}", 
    "time"=>"2022-05-19T09:00:57.120674031+00:00", 
    "docker"=>{"container_id"=>"1e31fb4a258736fc6eaeb35d5b5ed45c12c21963f64a97e917e87bd79734384b"}, 
    "kubernetes"=>{"container_name"=>"fluentd", 
        "namespace_name"=>"xxxxxxxx", 
        "pod_name"=>"fluentd-2n9kd", 
        "container_image"=>"docker.io/fluent/fluentd-kubernetes-daemonset:v1.14.6-debian-opensearch-1.0", 
        "container_image_id"=>"docker.io/fluent/fluentd-kubernetes-daemonset@sha256:054d75bef7ca91b6434d2bc86d387b7de919c408dac7e0ee88322cc5926d8798", 
        "pod_id"=>"3519ab44-1b54-4346-97f6-fdb0254d41cd", 
        "pod_ip"=>"10.129.1.7", 
        "host"=>"compute-0", 
        "labels"=>{"controller-revision-hash"=>"68458f67f6", 
            "k8s-app"=>"fluentd-logging", 
            "pod-template-generation"=>"1", 
            "version"=>"v1"},
        "master_url"=>"https://172.30.0.1:443/api", 
        "namespace_id"=>"ec79e7e2-526a-4213-9f43-fcb8ef8cafc7", 
        "namespace_labels"=>{"kubernetes_io/metadata_name"=>"xxxxxxxx"}
    }
}

NOTE that, whilst I can't tell if this is the offending record, I've selected it as I've noticed the json in the message hash contains a spurious }. This invalid json log appears to have been generated by fluentd container, presumably triggered by some condition in the original log, but I'm yet to find the root cause.

Whether this invalid json is related to this issue or not, I don't know.

ghost commented 2 years ago

Running with trace logs I've found illegal_state_exception Can't get text on a START_OBJECT at 1:1320:

2022-05-19 14:49:29 +0000 [trace]: #0 [out_os_app_json] bulk response: {"took"=>925, "errors"=>true, "items"=>[
...
...
{"index"=>{"_index"=>"dc4h-goal-1-application", "_id"=>"yKfL3IAB4pmfZooFjGdO", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field [attr.command] of type [text] in document with id 'yKfL3IAB4pmfZooFjGdO'. Preview of field's value: '{ordered=true, $replData=1, $configServerState={opTime={t=1, ts={$timestamp={t=1652971704, i=3}}}}, $db=config, $clusterTime={clusterTime={$timestamp={t=1652971704, i=4}}, signature={keyId=7099418590524211201, hash={$binary={base64=ww8hldh66Ok8b+I4w+Gt7OUgUnA=, subType=0}}}}, maxTimeMS=30000, update=mongos, bypassDocumentValidation=false, writeConcern={wtimeout=60000, w=majority}, updates=[{q={_id=shared-mongodb-mongos-6bd459598f-9bwvp:27017}, u={$set={mongoVersion=4.4.10-11, waiting=true, ping={$date=2022-05-19T14:48:25.183Z}, _id=shared-mongodb-mongos-6bd459598f-9bwvp:27017, up=9474, advisoryHostFQDNs=[]}}, upsert=true, multi=false}]}'", "caused_by"=>{"type"=>"illegal_state_exception", "reason"=>"Can't get text on a START_OBJECT at 1:1320"}}}}
...
...
2022-05-19 14:49:29 +0000 [debug]: #0 [out_os_app_json] taking back chunk for errors. chunk="5df5e77819c10486f117a8ce77af0598"
2022-05-19 14:49:29 +0000 [trace]: #0 [out_os_app_json] taking back a chunk instance=50740 chunk_id="5df5e77819c10486f117a8ce77af0598"
2022-05-19 14:49:29 +0000 [trace]: #0 [out_os_app_json] chunk taken back instance=50740 chunk_id="5df5e77819c10486f117a8ce77af0598" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="application.json", variables=nil, seq=0>
2022-05-19 14:49:30 +0000 [warn]: #0 [out_os_app_json] failed to flush the buffer. retry_times=0 next_retry_time=2022-05-19 14:49:31 +0000 chunk="5df5e77819c10486f117a8ce77af0598" error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"opensearch-cluster-master.dc4h-goal-1.svc\", :port=>9200, :scheme=>\"https\", :user=>\"admin\", :password=>\"obfuscated\", :path=>\"\"}): undefined method `call' for nil:NilClass"

The related record for the bulk insert appears to be:

{"stream":"stdout","logtag":"F","time":"2022-05-19T14:48:25.332505099+00:00","docker":{"container_id":"2699a109e1c84b45c3636b03460ba99ec70ea88b5066ca17c427ccd0c97e29b8"},"kubernetes":{"container_name":"mongod","namespace_name":"redacted","pod_name":"shared-mongodb-cfg-0","container_image":"docker.io/percona/percona-server-mongodb:4.4.10-11","container_image_id":"docker.io/percona/percona-server-mongodb@sha256:ea73a506fa02604660e3ef7d452d142a89587bb5daca15d3cc1b539a9b1000c5","pod_id":"a05dc0e0-d49b-41e6-a2cf-47e1d5366c1e","pod_ip":"10.129.1.26","host":"compute-0","labels":{"controller-revision-hash":"shared-mongodb-cfg-67b7fc579","app_kubernetes_io/component":"cfg","app_kubernetes_io/instance":"shared-mongodb","app_kubernetes_io/managed-by":"percona-server-mongodb-operator","app_kubernetes_io/name":"percona-server-mongodb","app_kubernetes_io/part-of":"percona-server-mongodb","app_kubernetes_io/replset":"cfg","statefulset_kubernetes_io/pod-name":"shared-mongodb-cfg-0"},"master_url":"https://172.30.0.1:443/api","namespace_id":"d83f45cc-dc13-4846-88f2-e9991388c94d","namespace_labels":{"kubernetes_io/metadata_name":"redacted"}},"t":{"$date":"2022-05-19T14:48:25.332+00:00"},"s":"I","c":"COMMAND","id":51803,"ctx":"conn11951","msg":"Slow query","attr":{"type":"command","ns":"config.$cmd","command":{"update":"mongos","bypassDocumentValidation":false,"ordered":true,"updates":[{"q":{"_id":"shared-mongodb-mongos-6bd459598f-9bwvp:27017"},"u":{"$set":{"_id":"shared-mongodb-mongos-6bd459598f-9bwvp:27017","ping":{"$date":"2022-05-19T14:48:25.183Z"},"up":9474,"waiting":true,"mongoVersion":"4.4.10-11","advisoryHostFQDNs":[]}},"multi":false,"upsert":true}],"writeConcern":{"w":"majority","wtimeout":60000},"maxTimeMS":30000,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1652971704,"i":4}},"signature":{"hash":{"$binary":{"base64":"ww8hldh66Ok8b+I4w+Gt7OUgUnA=","subType":"0"}},"keyId":7099418590524211201}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1652971704,"i":3}},"t":1}},"$db":"config"},"numYields":0,"reslen":619,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":1}},"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"w":1}},"Mutex":{"acquireCount":{"r":1}}},"flowControl":{"acquireCount":1,"timeAcquiringMicros":3},"writeConcern":{"w":"majority","wtimeout":60000,"provenance":"clientSupplied"},"storage":{},"protocol":"op_msg","durationMillis":103},"tag":"application.json"}

The JSON looks valid as far as I can tell, so my previous comment may be a red herring. However, the addition of replace_invalid_sequence true to the json <filter> does seem to stop the errors occurring on the openshift out plugin.... UPDATE: not true. replace_invalid_sequence true does not work around the problem

hth

cosmo0920 commented 2 years ago

Thanks! Possible candidates of reproducible records should be very helpful for us. I'll dig into the root cause of this issue.

sdwerwed commented 2 years ago

@cosmo0920 I can reproduce it, I think it is happening if the bulk request gets error 400 from the OpenSearch, for example, because of datatype mismatch https://github.com/fluent/fluent-plugin-opensearch/issues/53. I am getting this error after I upgraded to the latest version. Before the error was very clear when log_os_400_reason true was set, now if fails to flush the chunk and gives undefined method ``call' for nil:NilClass"

I have downgraded to 1.0.2 with fluent-gem install 'fluent-plugin-opensearch' -v 1.0.2 and works as expected.

thoro commented 2 years ago

Same issue with 1.0.4, it seems to also not flush the items from the buffer, which leads to a buffer overrun and then total system halt (i.e. no logs at all)

Downgrade to 1.0.2 solved it. But frankly that's not really a solution.

cosmo0920 commented 2 years ago

I'm not sure why block becomes nil. Then, I created a patch to use if clause instead of block operation. How about this patch?