Got "invalid byte sequence in UTF-8" error when use concat plugin

chikinchoi commented 4 years ago

Problem

Hi Team,

I have applied the fluent-plugin-concat in order to join logs that docker has split over multiple lines due to its 16KB line limit. However, I found an error "dump an error event: error_class=ArgumentError error="invalid byte sequence in UTF-8" location="/usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-concat-2.4.0/lib/fluent/plugin/filter_concat.rb:291:in `match'" recently. I have added the replace_invalid_sequence but no luck. Please advise. Thank you!!

Steps to replicate

I cannot reproduce the error as there are so many logs send to this fluentd. Below is my filter config in fluentd:

<filter **firelens**>
  @type concat
  key log
  multiline_start_regexp '^\{\\"@timestamp'
  multiline_end_regexp '/\}/'
  separator ""
  flush_interval 1
  timeout_label @NORMAL
</filter>

<label @NORMAL>
  <match **>
    @type null
  </match>
</label>

Your environment

fluentd' version '1.11.1 fluent-plugin-concat' version '2.4.0

cosmo0920 commented 4 years ago

However, I found an error "dump an error event: error_class=ArgumentError error="invalid byte sequence in UTF-8" location="/usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-concat-2.4.0/lib/fluent/plugin/filter_concat.rb:291:in `match'" recently. I have added the replace_invalid_sequence but no luck. Please advise. Thank you!!

This parameter should be added in filter parser plugin configuration not filter concat plugin.

https://docs.fluentd.org/filter/parser#replace_invalid_sequence replace_invalid_sequence as true should handle invalid byte sequence in UTF8 or other encodings.

chikinchoi commented 4 years ago

Hi @cosmo0920 ,

I understand that replace_invalid_sequence should be added in filter parser plugin. I saw that there are some parser plugin, e.g "json", "csv", "multiline". However, I don't need to parse the data into other format in the concat filter, may I know how to add the replace_invalid_sequence with concat filter? Thank you.

<filter **firelens**>
  @type concat
  key log
  multiline_start_regexp '^\{\\"@timestamp'
  multiline_end_regexp '/\}/'
  separator ""
  flush_interval 1
  timeout_label @NORMAL
</filter>

chikinchoi commented 4 years ago

Hi @cosmo0920 ,

I think that there is a mutual exclusion in this case. I have considered the below solution to fix the "docker has split over multiple lines due to its 16KB line limit" issue and also the "invalid byte sequence in UTF-8" issue.

According to [1], I found that the event proceeds through the filter configuration in descending order. Therefore, if I place the concat filter first, it will trigger the "invalid byte sequence in UTF-8' issue as the "replace_invalid_sequence" is in the parser filter. If I place the parser filter first, it will trigger the "docker has split over multiple lines due to its 16KB line limit" issue as the "key" field in some logs is not a complete log due to split to multiple lines. Could you please add a new feature which is to add a new parameter replace_invalid_sequence into the concat plugin or suggest another solution to fix this mutual exclusion? Thank you very much!

<filter **firelens**>
  @type concat
  key log
  multiline_start_regexp '^\{\\"@timestamp'
  multiline_end_regexp '/\}/'
  separator ""
  flush_interval 1
  timeout_label @NORMAL
</filter>

<filter **firelens**>
  @type parser
  key_name log
  reserve_data true
  replace_invalid_sequence true
  emit_invalid_record_to_error false
  <parse>
  @type json
  </parse>
</filter>

[1] https://docs.fluentd.org/filter

cosmo0920 commented 4 years ago

Could you please add a new feature which is to add a new parameter replace_invalid_sequence into the concat plugin or suggest another solution to fix this mutual exclusion? Thank you very much!

We won't add replace_invalid_sequance on filter concat plugin. In Fluentd world, one plugin should has one functionality. Monolithic plugin is not followed for Fluentd design concept.

Instead, how about using fluent-plugin-string-scrub to scrub invalid byte sequences?

chikinchoi commented 4 years ago

Hi @cosmo0920 ,

Thank you for your suggestion. I added the string_scrub filter as below config and the invalid byte sequence issue is gone.

<filter **>
  @type string_scrub
  replace_char ?
</filter>

However, I don't really understand about this string_scrub plugin. May I know what is the usage or replace_char ?. Can I have some example input and the output after perform the filter? Thank you very much!!

cosmo0920 commented 4 years ago

replace_char is used in https://ruby-doc.org/core-2.4.0/String.html#method-i-scrub-21 . And invalid byte sequence issue is solved. Closing.

fluent-plugins-nursery / fluent-plugin-concat