fluent / fluent-plugin-s3

Amazon S3 input and output plugin for Fluentd
https://docs.fluentd.org/output/s3
314 stars 218 forks source link

Logs missing after s3 upload #262

Closed taraspos closed 5 years ago

taraspos commented 5 years ago

Hello, We are using Fluentd to tail files with JSON events (new file created every 10 minutes), group the events by some id and upload to s3. During one of the events generation spike, we noticed that the big number of these events are missing in the resulted files on S3, we basically greped the events by some keywords in the origin files and in the resulted files downloaded from s3, result events differed something like 200k against 40k, however, this seems to happen during the spikes only, because in normal conditions number of events seems fine (however I need to verify this as well).

<source>
  @type tail
  limit_recently_modified 6h
  path "#{ENV["EVENTS_PATH_PATTERN"]}" 
  pos_file "#{ENV["ANALYTICS_SHARED_VOLUME_PATH"]}/fluentd/events.pos"
  tag events.*

  <parse>
    @type json
    # use the current time, as `time_key`
    time_key ""
    keep_time_key
  </parse>
</source>

<match events.**>
  # rewrite tag to the format "app.<app_id>.event" for future s3 upload
  @type rewrite_tag_filter
  <rule>
    key     app_id
    pattern ^(.+)$
    tag     app_id.$1.event
  </rule>
</match>

# Upload analytics events to s3
<match app_id.**>
  @type s3
  @id events_s3

  s3_bucket "#{ENV["EVENTS_BUCKET"]}"
  s3_region "#{ENV["EVENTS_REGION"]}"

  # use second part (app_id) of the tag "app.<app_id>.event"
  path %Y/%m/%d/%H/${tag[1]}

  # so result s3 file will look like:
  # s3://<bucket>/<year>/<month>/<day>/<hour>/<app_id>/<time_slice/date>_<hostname>_<index>.log.gz
  s3_object_key_format %{path}/%{time_slice}_%{hostname}_%{index}.log.%{file_extension}

  <buffer tag,time>
    @type file
    path "#{ENV["ANALYTICS_SHARED_VOLUME_PATH"]}/fluentd/buffer/events/"
    timekey            "10m"
    timekey_wait       "1m"
    flush_mode         interval
    flush_interval     "10m"
    flush_thread_count "10"
    timekey_use_utc
  </buffer>
  <format>
    @type json
  </format>
</match>

The number of resulted files on s3 is correct - 6 files per folder, and 1 folder every hour, every day. app_id key is present in all event files. There are no errors or warnings in logs, except a very small number of slow_flush_log_threshold warnings. buffer_queue_length looks good as well.

No more ideas on how to debug this further.

taraspos commented 5 years ago

I see the in the documentation next line:

in_tail follows tail -F command behaviour by default, so in_tail reads only newer logs. If you want to read existing lines for batch use case, set read_from_head true.

Is it possible, if during events generation spike new file created with tons of lines but fluentd picking up new ones only once per minute, so it skips all the lines generated before it started reading the file?

Does this means that setting read_from_head to true could resolve my issue?

Also, I see the following not in the documentation:

When this is true, in_tail tries to read a file during start up phase. If target file is large, it takes long time and starting other plugins isn't executed until reading file is finished.

Is it true only for new files? Will it try to read the complete file even it is already exist in the .pos file?

repeatedly commented 5 years ago

so it skips all the lines generated before it started reading the file?

If you use * in in_tail path paramter, it may happen. Check read_from_head true is important.

Is it true only for new files? Will it try to read the complete file even it is already exist in the .pos file?

If the file exists in pos_file, reading file starts at pos_file recorded position. So impact should be small.

Other tip. If buffer files are flushed concurrently during the spike, setting %{uuid_flush} or %{hex_random} in s3_object_key_format may help to avoid object overwritten. If uploading files is sequential, this is not needed.

taraspos commented 5 years ago

Thanks. I will try read_from_head true and see if it helps.

We have versioning enabled on the bucket, so I checked that files not being overwritten in this case, but thanks for the tip!

taraspos commented 5 years ago

read_from_head true helped!