fluent / fluent-plugin-s3

Amazon S3 input and output plugin for Fluentd
https://docs.fluentd.org/output/s3
314 stars 218 forks source link

Plugin unable to read file size of 4GB. Is there any upper limit? #271

Closed amitdhawan closed 3 years ago

amitdhawan commented 5 years ago

Check CONTRIBUTING guideline first and here is the list to help us investigate the problem.

fluentd or td-agent version. td-agent 1.3.3

Environment information:

Operating system: cat /etc/os-release NAME="Ubuntu" VERSION="16.04.5 LTS (Xenial Xerus)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 16.04.5 LTS" VERSION_ID="16.04" HOME_URL="http://www.ubuntu.com/" SUPPORT_URL="http://help.ubuntu.com/" BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/" VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial Kernel version: uname -r 4.4.0-1077-aws Your configuration

Your problem explanation. If you have an error logs, write it together.

Below is my config file:-

<source>
  @type s3
  s3_bucket testtemplatedata
  s3_region us-east-1
 @log_level error
 <assume_role_credentials>
     role_arn  arn:aws:iam::720991919452:role/Base-Role_webfontgenerator-preprod
     role_session_name fluentdSession
</assume_role_credentials>
  <sqs>
    queue_name fluentd_queue
  </sqs>
</source>

#<system>
#  process_name fluendpoc
#  <log>
#    format json
#    time_format %Y-%m-%d
#  </log>
#</system>
<filter input.s3>
 @type grep
  <regexp>
    key message
    pattern /1.css/
  </regexp>
</filter>

<filter input.s3>
 @type parser
 key_name message
 remove_key_name_field true
 <parse>
    @type regexp
    expression /^(?<timestamp>[^ ]*) [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* (?<url>[^ ]*) [^ ]* [^ ]* [^ ]* [^ ]* "(?<platform>[^\"]*)" [^ ]* [^ ]* (?<ref>[^ ]*)/
    time_format %d/%b/%Y:%H:%M:%S %z
  </parse>
</filter>

<match input.s3>
 #<store>
 #  @type file
 #  path /var/log/td-agent/s3.log
 #</store>
# <store>
   @type kinesis_streams
   stream_name PageViewTrackingStream
   region us-east-1
   <assume_role_credentials>
     role_arn  arn:aws:iam::720991919452:role/Base-Role_webfontgenerator-preprod
     role_session_name kinesisSession
   </assume_role_credentials>
 #</store>
</match>

m processing log files uploaded on S3 and pushing them to Kinesis stream. For checking out Fluentd capabilities right now im running td-agent in AWS EC2 t2.micro instance. Now, for log files containg 100 records or so Im getting the output logs in Kinesis. But when i upload a log file of aroung 175MB gz format the fliuentd seems to behave unexpectedly and keeps on showing me the trace log as below

2019-04-11 18:22:25 +0000 [trace]: #0 fluent/log.rb:281:trace: enqueueing all chunks in buffer instance=69999571134220

Not able to ready file [Gz format] size of around 760MB when unzip it is around 4GB.

Is there any upper limit of file size in this?

amitdhawan commented 5 years ago

Here is the log of top command and shows whole residence memory is consumed by ruby and also the CPU usage in 100%

Screen Shot 2019-04-12 at 7 53 45 PM

@repeatedly any help

amitdhawan commented 5 years ago

This is the same as this issue https://github.com/fluent/fluentd/issues/2379

repeatedly commented 5 years ago

I think no limit for target file. But fluentd focuses on streaming data with lower latency so fluentd is not optimized for like 4GB archive data. Embulk or other batch/bulk loader is fit for such cases. Consuming resource seems normal. Decompress 760MB -> 4GB, parsing 4GB in input, format 4GB in output(I'm not familiar with kinesis output so this point is just assumption) needs CPU power.

amitdhawan commented 5 years ago

@repeatedly So do u mean to say I can process large files in parallel from S3 using embulk?

And is fluentd meant for streaming real time logs rather than processing large files in a go?

amitdhawan commented 5 years ago

@repeatedly I did a small poc on embulk and got the impression that is a bulk importer of data from S3 by using a trigger by command line and doesn't fit my requirement of importing data on the trigger of file upload on s3 which fluentd does so.

Let me know in case u think otherwise and I can use embulk in my scenario.

amitdhawan commented 5 years ago

@repeatedly I can now see memory going full throttle even if I consume 200mb gz file from s3. The Ruby aommand is taking full cpu power and memory. Do I need to do some optimizations in terms of Ruby or fluentd to make all this work.

Get Outlook for Androidhttps://aka.ms/ghei36


Amit Dhawan Lead Software Engineer Monotype Solutions India Pvt. Ltd. Fourth Floor - Tower B, Prius Universal Plot A-3, 4 & 5, Sector -125 Noida 201301 Phone +91 9953936137 Our Brand Family ​ Monotype.com Olapic.com

From: Masahiro Nakagawa notifications@github.com Sent: Saturday, April 13, 2019 2:21:21 AM To: fluent/fluent-plugin-s3 Cc: Dhawan, Amit; Author Subject: Re: [fluent/fluent-plugin-s3] Plugin unable to read file size of 4GB. Is there any upper limit? (#271)

I think no limit for target file. But fluentd focuses on streaming data with lower latency so fluentd is not optimized for 4GB archive data. Embulk or other batch/bulk loader is fit for such cases. Consuming resource seems normal. Decompress 760MB -> 4GB, parsing 4GB in input, format 4GB in output needs CPU power.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_fluent_fluent-2Dplugin-2Ds3_issues_271-23issuecomment-2D482718702&d=DwMFaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=IYP1DntLKkzvnKg4LWusdSl7wH9JPw74WIsxoqR5cwQ&m=XkDrgFATAaFRQJ8SN8n7j_48HWVqSyJZ1foLAGepQ3g&s=X76qu6yzZTilxcJlRMDY9CTDer8UAGgchFNgTCAYOwQ&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AOMhBtpVNEroXuCcPGfGXAh3xAJyAePIks5vgPHJgaJpZM4cr0Ka&d=DwMFaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=IYP1DntLKkzvnKg4LWusdSl7wH9JPw74WIsxoqR5cwQ&m=XkDrgFATAaFRQJ8SN8n7j_48HWVqSyJZ1foLAGepQ3g&s=lWhs6v6e6-osFxduZMTll-ptsBpu6uukTjWPcrIPzdo&e=.

repeatedly commented 5 years ago

@okkez Do you have any insight for this? With 200mb gz, fluentd will take 200mb + uncompressed 200mb (maybe 400+mb) + more(new events from file) memory. So I assume fluentd temporarily consumes about 1G for this case. CPU usage seems to depend file content and kinesis output implemetation.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

github-actions[bot] commented 3 years ago

This issue was automatically closed because of stale in 30 days