Open SoamA opened 1 year ago
@SoamA Sorry I'm not completely understanding this... can you explain it with an example s3_key_prefix that has $INDEX and the full S3 key names and tag values. And explain the file fragment part some more, thanks!
Is the issue that for each value of $TAG, you want sequential indexes?
I should note this bug: https://github.com/aws/aws-for-fluent-bit/issues/653
Yes, here's the fluent bit config:
s3_key_format /spark-event-logs/mycluster}/eventlog_v2_$TAG[1]/events_$INDEX_$TAG[1]_$UUID
Note that we're relying on the INDEX
feature to embed a numerically increasing sequence in the filenames. Here's what the Fluent Bit upload log looks like:
2023-06-07T18:03:44.102855211Z stderr F [2023/06/07 18:03:44] [ info] [output:s3:s3.3] Successfully uploaded object /spark-event-logs/adhoc/eventlog_v2_spark-3cc4886822bd405c80b6a16718547ad4/events_137_spark-3cc4886822bd405c80b6a16718547ad4_ys1Iws3p
2023-06-07T18:10:46.119749737Z stderr F [2023/06/07 18:10:46] [ info] [output:s3:s3.3] Successfully uploaded object /spark-event-logs/adhoc/eventlog_v2_spark-3cc4886822bd405c80b6a16718547ad4/events_138_spark-3cc4886822bd405c80b6a16718547ad4_69gehrN4
2023-06-07T18:14:04.134622779Z stderr F [2023/06/07 18:14:04] [ info] [output:s3:s3.3] Successfully uploaded object /spark-event-logs/adhoc/eventlog_v2_spark-3cc4886822bd405c80b6a16718547ad4/events_139_spark-3cc4886822bd405c80b6a16718547ad4_q2UV8ypa
2023-06-07T18:15:02.838638712Z stderr F [2023/06/07 18:15:02] [ info] [output:s3:s3.3] Successfully uploaded object /spark-event-logs/adhoc/eventlog_v2_spark-efd980675cd84f99814cd5ce20c9f17b/events_140_spark-efd980675cd84f99814cd5ce20c9f17b_Xdzd0OL5
2023-06-07T18:15:19.086118196Z stderr F [2023/06/07 18:15:19] [ info] [output:s3:s3.3] Successfully uploaded object /spark-event-logs/adhoc/eventlog_v2_spark-3cc4886822bd405c80b6a16718547ad4/events_141_spark-3cc4886822bd405c80b6a16718547ad4_f8W0jfZO
2023-06-07T18:23:02.915445149Z stderr F [2023/06/07 18:23:02] [ info] [output:s3:s3.3] Successfully uploaded object /spark-event-logs/adhoc/eventlog_v2_spark-3cc4886822bd405c80b6a16718547ad4/events_142_spark-3cc4886822bd405c80b6a16718547ad4_9cgENQzg
2023-06-07T18:32:02.955507888Z stderr F [2023/06/07 18:32:02] [ info] [output:s3:s3.3] Successfully uploaded object /spark-event-logs/adhoc/eventlog_v2_spark-efd980675cd84f99814cd5ce20c9f17b/events_143_spark-efd980675cd84f99814cd5ce20c9f17b_5RdoPafS
In the target S3 bucket, this produces the following:
s3://mybucket/spark-event-logs/adhoc/eventlog_v2_spark-3cc4886822bd405c80b6a16718547ad4/:
events_137_spark-3cc4886822bd405c80b6a16718547ad4_ys1Iws3p
events_138_spark-3cc4886822bd405c80b6a16718547ad4_69gehrN4
events_139_spark-3cc4886822bd405c80b6a16718547ad4_q2UV8ypa
events_141_spark-3cc4886822bd405c80b6a16718547ad4_f8W0jfZO
events_142_spark-3cc4886822bd405c80b6a16718547ad4_9cgENQzg
and
s3://mybucket/spark-event-logs/spark-event-logs/adhoc/eventlog_v2_spark-efd980675cd84f99814cd5ce20c9f17b/:
events_140_spark-efd980675cd84f99814cd5ce20c9f17b_Xdzd0OL5
events_143_spark-efd980675cd84f99814cd5ce20c9f17b_5RdoPafS
This is not desirable since for the first case, there's a jump from 139 to 141 and for the second, there's a jump from 140 to 143. What we really want is:
s3://mybucket/spark-event-logs/adhoc/eventlog_v2_spark-3cc4886822bd405c80b6a16718547ad4/:
events_001_spark-3cc4886822bd405c80b6a16718547ad4_ys1Iws3p
events_002_spark-3cc4886822bd405c80b6a16718547ad4_69gehrN4
events_003_spark-3cc4886822bd405c80b6a16718547ad4_q2UV8ypa
events_004_spark-3cc4886822bd405c80b6a16718547ad4_f8W0jfZO
events_005_spark-3cc4886822bd405c80b6a16718547ad4_9cgENQzg
and
s3://mybucket/spark-event-logs/spark-event-logs/adhoc/eventlog_v2_spark-efd980675cd84f99814cd5ce20c9f17b/:
events_001_spark-efd980675cd84f99814cd5ce20c9f17b_Xdzd0OL5
events_002_spark-efd980675cd84f99814cd5ce20c9f17b_5RdoPafS
i.e each file upload has its own INDEX counter as opposed to having a single counter shared amongst multiple file uploads. It actually doesn't even have to start with 001, it can be any number as long as it's increasing sequentially. So
s3://mybucket/spark-event-logs/adhoc/eventlog_v2_spark-3cc4886822bd405c80b6a16718547ad4/:
events_137_spark-3cc4886822bd405c80b6a16718547ad4_ys1Iws3p
events_138_spark-3cc4886822bd405c80b6a16718547ad4_69gehrN4
events_139_spark-3cc4886822bd405c80b6a16718547ad4_q2UV8ypa
events_140_spark-3cc4886822bd405c80b6a16718547ad4_f8W0jfZO
events_141_spark-3cc4886822bd405c80b6a16718547ad4_9cgENQzg
and
s3://mybucket/spark-event-logs/spark-event-logs/adhoc/eventlog_v2_spark-efd980675cd84f99814cd5ce20c9f17b/:
events_145_spark-efd980675cd84f99814cd5ce20c9f17b_Xdzd0OL5
events_146_spark-efd980675cd84f99814cd5ce20c9f17b_5RdoPafS
would also work. Let me know if that helps in clarifying the problem.
I think I get it. Are you running on k8s?
You have multiple tags processed by a single S3 output, and the $INDEX numbers should be sequential within a tag/stream of logs. Currently it just increments up in time within the S3 output.
I'll have to take this as a feature request, which I probably won't be able to prioritize soon sorry. @SoamA you can help by submitting a feature request via AWS Support.
For a short term workaround, I wonder if there's some way you could have multiple S3 outputs, one for each tag, so each one has its own $INDEX. Is that possible? How many tags do you have?
Could you do some sort of metadata rewrite_tag scheme to change the tags to be a small set of meaningful values? (I can help with this if you explain your architecture more).
Hey @PettitWesley,
I think I get it. Are you running on k8s?
Yes, we're on EKS.
You have multiple tags processed by a single S3 output, and the $INDEX numbers should be sequential within a tag/stream of logs. Currently it just increments up in time within the S3 output.
I'll have to take this as a feature request, which I probably won't be able to prioritize soon sorry. @SoamA you can help by submitting a feature request via AWS Support.
Yes, will do. Stay tuned!
For a short term workaround, I wonder if there's some way you could have multiple S3 outputs, one for each tag, so each one has its own $INDEX. Is that possible? How many tags do you have?
Could you do some sort of metadata rewrite_tag scheme to change the tags to be a small set of meaningful values? (I can help with this if you explain your architecture more).
Here's the relevant INPUT part of the fluent-bit conf:
[INPUT]
Name tail
Tag sel.<spark_internal_app_id>
Path /var/log/containers/eventlogs/*\.inprogress
DB /var/log/sel_spark.db
multiline.parser docker, cri
Mem_Buf_Limit 10MB
Skip_Long_Lines On
Refresh_Interval 10
Tag_Regex (?<spark_internal_app_id>spark-[a-z0-9]+)
Buffer_Chunk_Size 1MB
Buffer_Max_Size 5MB
Spark driver processes on an EKS host are configured to output their event logs to /var/log/containers/eventlogs/.
Each log is a single file. They look like:
$ ls -al /var/log/containers/eventlogs/
total 106464
drwxrwxrwx 2 root root 4096 Jun 8 21:18 .
drwxr-xr-x 3 root root 8192 Jun 8 21:14 ..
-rw-rw---- 1 spark root 41258305 Jun 7 18:30 spark-3cc4886822bd405c80b6a16718547ad4
-rw-rw---- 1 spark root 26285083 Jun 8 21:18 spark-5d6e74c026e44ae594311dd03d2da5bc
-rw-rw---- 1 spark root 41316254 Jun 7 23:16 spark-c1b075c4bf3b491d85e8d2159b141731
-rw-rw---- 1 spark root 135360 Jun 7 18:08 spark-efd980675cd84f99814cd5ce20c9f17b
When the Spark driver process is actively generating the logs, they have an .inprogress
suffix. Once the job has completed running, the .inprogress
suffix is removed. So in Fluent Bit, TAG[1]
matches the 41 char alphanumeric string in the event log file name ( eg. c1b075c4bf3b491d85e8d2159b141731
, efd980675cd84f99814cd5ce20c9f17b
from the directory listed above). Because this alphanumeric string is randomly generated by Spark, I don't think we could meaningfully limit it to anything smaller, sadly, but I'm open to suggestions!
Submitted feature request in AWS support ticket https://support.console.aws.amazon.com/support/home?region=us-east-1#/case/?displayId=12990900451.
Fluent Bit Version Info
Cluster Details
Application Details
Steps to reproduce issue
Use S3 plugin output. Upload pattern for OUTPUT has to include INDEX. Use Fluent Bit to upload two files simultaneously that match this pattern.
Related Issues