elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.15k stars 4.91k forks source link

drop_fields processor is surprisingly slow #30815

Open henrikno opened 2 years ago

henrikno commented 2 years ago

Hi, We're parsing some json-formatted high-volume logs, and noticed that we could only process up to 8000 documents per second per filebeat. I tried to narrow down what's the bottleneck. By just reading the file and parsing as json, I could get up to 20k documents per second. Then I commented in our list of processors that rename some fields, drop some unnecessary fields etc. I found that the slowest processor seems to be drop_fields. Dropping the first 16 fields alone takes it from 20k/s to 12k/s, then the other conditional one makes it drop to 8k/s.

Sample filebeat:

filebeat:
  inputs:
    - type: log
        paths:
          - /var/log/proxy.requests*.log*
        tags: [ "jsonformat", "proxy-requests" ]
        close_timeout: 2h
        json:
          keys_under_root: true
          overwrite_keys: true
# note our config uses filebeat.autodiscover but it doesn't seem to impact performance.

# from 20k/s to 13k/s
  - drop_fields:
      fields:
        - offset
        - cloud.image.id
        - cloud.account.id
        - docker.container.command
        - docker.container.labels.co.elastic.vcs-branch
        - docker.container.labels.co.elastic.ci.worker
        - docker.container.labels.author
        - docker.container.labels.org.label-schema.build-date
        - docker.container.labels.org.label-schema.license
        - docker.container.labels.org.label-schema.name
        - docker.container.labels.org.label-schema.vendor
        - docker.container.labels.org.label-schema.schema-version
        - docker.container.labels.org.label-schema.url
        - docker.container.labels.org.label-schema.usage
        - docker.container.labels.org.label-schema.vcs-url
        - docker.container.labels.org.opencontainers.image

# from 13k/s to 8k/s
  - drop_fields:
      when:
        and:
          - contains:
              container.name: "proxy"
          - not.equals:
              docker.container.name: "rollup"
      fields:
        - host
        - beat.hostname
        - docker.container.labels
        - docker.container.name
        - docker.container.id
        - meta.cloud.availability_zone
        - meta.cloud.instance_id
        - meta.cloud.provider
        - meta.cloud.region
        - metadata_architecture
        - metadata_availabilityZone
        - metadata_billingProducts
        - metadata_devpayProductCodes
        - metadata_imageId
        - metadata_instanceId
        - metadata_instanceType
        - metadata_kernelId
        - metadata_marketplaceProductCodes
        - metadata_pendingTime
        - metadata_ramdiskId
        - metadata_region
        - metadata_version
        - availability_zones
        - capacity
        - instance_count
        - instance_capacity

output:
  file:
    path: "/dev/"
    filename: "null"

I was watching the number of events per second while commenting out each processor, and each time the performance increased:

image

Turning on debug logging I see some failures to remove fields that don't exist (some of them are optinal, and some have been renamed in later versions), but it shouldn't be that expensive to check if a field exists?

2022-03-14T21:08:20.235Z        DEBUG   [processors]    processing/processors.go:128    Fail to apply processor global{drop_fields={"Fields":["offset","cloud.image.id","cloud.account.id","docker.container.command","docker.container.labels.co.elastic.vcs-branch","docker.container.labels.co.elastic.ci.worker","docker.container.labels.author","docke
r.container.labels.org.label-schema.build-date","docker.container.labels.org.label-schema.license","docker.container.labels.org.label-schema.name","docker.container.labels.org.label-schema.vendor","docker.container.labels.org.label-schema.schema-version","docker.container.labels.org.label-schema.url","docker.container.labels.org.label-schema.usag
e","docker.container.labels.org.label-schema.vcs-url","docker.container.labels.org.opencontainers.image"],"IgnoreMissing":false}, drop_fields={"Fields":["host","beat.hostname","docker.container.labels","docker.container.name","docker.container.id","meta.cloud.availability_zone","meta.cloud.instance_id","meta.cloud.provider","meta.cloud.region","m
etadata_architecture","metadata_availabilityZone","metadata_billingProducts","metadata_devpayProductCodes","metadata_imageId","metadata_instanceId","metadata_instanceType","metadata_kernelId","metadata_marketplaceProductCodes","metadata_pendingTime","metadata_ramdiskId","metadata_region","metadata_version","availability_zones","capacity","instanc
e_count","instance_capacity"],"IgnoreMissing":false}, condition=contains: map[] and !equals: map[docker.container.name:0x1e39120]}: failed to drop field [offset]: key not found; failed to drop field [cloud.image.id]: key not found; failed to drop field [cloud.account.id]: key not found; failed to drop field [docker.container.command]: key not fou
nd; failed to drop field [docker.container.labels.co.elastic.vcs-branch]: key not found; failed to drop field [docker.container.labels.co.elastic.ci.worker]: key not found; failed to drop field [docker.container.labels.author]: key not found; failed to drop field [docker.container.labels.org.label-schema.build-date]: key not found; failed to drop
 field [docker.container.labels.org.label-schema.license]: key not found; failed to drop field [docker.container.labels.org.label-schema.name]: key not found; failed to drop field [docker.container.labels.org.label-schema.vendor]: key not found; failed to drop field [docker.container.labels.org.label-schema.schema-version]: key not found; failed
to drop field [docker.container.labels.org.label-schema.url]: key not found; failed to drop field [docker.container.labels.org.label-schema.usage]: key not found; failed to drop field [docker.container.labels.org.label-schema.vcs-url]: key not found; failed to drop field [docker.container.labels.org.opencontainers.image]: key not found

(Tested on 7.12, but also tried 8.1.0 with no improvement in performance)

henrikno commented 2 years ago

This hacky workaround seems to help some (from 8k/s to 13k/s):

  - rename:
      fields:
        - from: metadata_architecture
          to: drop.metadata_architecture
        - from: metadata_availabilityZone
          to: drop.metadata_availabilityZone
          # etc
  - drop_fields:
      fields:
        - drop
elasticmachine commented 2 years ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz commented 2 years ago

Can you start filebeat with profiling enabled and capture a CPU profile (ideally more than one) with the slow processors running? That will tell us where filebeat is spending its time.

You can start the profiling server with the --httpprof command line option (command documentation) or the configuration file (http endpoint documentation). You need to set at least http.enabled and http.pprof.enabled to true if using the configuration file.

Once you've started filebeat with profiling enabled you can capture 30 second CPU profiles with curl $HOST:$PORT/debug/pprof/profile -o profile.prof or equivalent.

henrikno commented 2 years ago

Attached 3 profiles + heap profile.zip These are from filebeat 8.1.0.

cmacknz commented 2 years ago

Awesome, thanks! All three CPU profiles look similar which is good. Here's a flamegraph from the first profile:

Screen Shot 2022-03-17 at 4 57 00 PM

I see a bunch of time being spent formatting error messages in the dropFields processors (errors.Wrapf) that line up with what you've observed above.

cmacknz commented 2 years ago

This is the code path contributing to a large part of the slowness here:

https://github.com/elastic/beats/blob/4a71e5e7eb8faa711785aee434a2aadd248439e0/libbeat/processors/actions/drop_fields.go#L68-L81

We are building a slice of errors, so the more errors there are the more expensive it becomes as we both incur calls to errors.Wrapf and need to reallocate the slice as it grows (it reallocates each time it doubles).

henrikno commented 2 years ago

Ha, just noticed it now supports ignore_missing, which was added in 7.4. We probably added the processor before it was an option. That should make it a lot cheaper.

cmacknz commented 2 years ago

Yes, I was just about to suggest that. If it is still slow try and capture another profile if it isn't too much work, the problem is much easier to find that way. Likely it will just be the next slowest path in the current set of profiles though if you can't.

henrikno commented 2 years ago

Went from 10k docs/s to 20k/s! For reference I ran without any processors which reaches 30k/s. profile2.zip

I suppose go/the processors doesn't have something along the line of if log.isDebugEnabled { do_error_reporting } to avoid constructing them if they're never logged?

cmacknz commented 2 years ago

I suppose go/the processors doesn't have something along the line of if log.isDebugEnabled { do_error_reporting } to avoid constructing them if they're never logged?

No, processor error handling can definitely be improved.

Interestingly in the profile2.zip you attached, the profile-no-processors profile still has processors running. A different set than the profile-ignore-missing profile though.

Here is the processing flame graph of the "no processors" case:

no-processors

Here is the processing flame graph of the "ignore missing" processors case:

processors-ignore-missing

The profiles are largely the same outside of the processing call stack, so I've omitted those branches.

cmacknz commented 2 years ago

The rename and drop_fields processors now spend most of their time in https://github.com/elastic/beats/blob/4bc4d19c28851306086e7f3abbe45b389d740c37/libbeat/common/mapstr.go#L436-L448

I also see the add dynamic fields metadata processor call stack spending the majority of its time in https://github.com/elastic/beats/blob/4bc4d19c28851306086e7f3abbe45b389d740c37/libbeat/common/mapstr.go#L150-L152

The improvements to make here aren't as obvious.

zez3 commented 1 year ago

Hmm, @henrikno what does your beat.stats.libbeat.output.events.dropped show?

@cmacknz perhaps the Agent dashboard should include the dropped events as well

botelastic[bot] commented 11 months ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

zez3 commented 11 months ago

ahh :(

cmacknz commented 11 months ago

We have reproduced something similar to this in our internal benchmarks, no fix yet unfortunately.