cloudfoundry-community / firehose-to-syslog

Send firehose events from Cloud Foundry to syslog.
MIT License
44 stars 58 forks source link

v5.0.0+ too slow logging with PCF/PAS 2.5 (no problem in v4.1.1) #216

Open daichi703n opened 5 years ago

daichi703n commented 5 years ago

firehose-to-syslog/v5.0.0+ with PCF/PAS 2.5 is too slow logging.

On my env, publish_per_sec is about under 10. Switch to firehose-to-syslog/v4.1.1, publish_per_sec is over 1,000.

Which update cause this issue? How to fix this with tuning parameters?

Environment

I'm using firehose-to-syslog as App on PCF with this manifest.

applications:
- name: firehose-to-syslog
  buildpack: binary_buildpack
  command: ./firehose-to-syslog_linux_amd64
  health-check-type: http
  health-check-http-endpoint: /
  route: firehose-to-syslog.sys.${DOMAIN}
  memory: 2G
  instances: 3
  env:
    GOPACKAGENAME: github.com/cloudfoundry-community/firehose-to-syslog
    API_ENDPOINT: https://api.sys.${DOMAIN}
    DEBUG: false
    DOPPLER_ENDPOINT: wss://doppler.sys.${DOMAIN}:443
    EVENTS: LogMessage,Error
    FIREHOSE_CLIENT_ID: firehose-to-syslog
    FIREHOSE_CLIENT_SECRET: ${FIREHOSE_CLIENT_SECRET}
    FIREHOSE_SUBSCRIPTION_ID: firehose-a
    LOG_EVENT_TOTALS: true
    LOG_EVENT_TOTALS_TIME: 10s
    SKIP_SSL_VALIDATION: true
    SYSLOG_ENDPOINT: ${SYSLOG_ENDPOINT}
    SYSLOG_PROTOCOL: tcp+tls
    CERT_PEM: cert.pem
    ENABLE_STATS_SERVER: true
    CF_PULL_TIME: 0

Result of tests

Version matrix

Matrix of this issue occurrence is as below;

firehose-to-syslog PAS OK/NG(slow)
v5.0.0 2.4.2 OK
v5.0.0 2.4.4 NG
v5.0.0 2.5.2 NG
v5.1.0 2.5.2 NG
v4.1.1 2.5.2 OK

Logs

I checked the performance at https://firehose-to-syslog.sys.${DOMAIN}/stats/app .

NG case:

{
consume: 0,
consume_per_sec: 0,
consume_fail: 0,
consume_http_start_stop: 0,
consume_value_metric: 0,
consume_counter_event: 0,
consume_log_message: 3793,
consume_error: 0,
consume_container_metric: 0,
consume_unknown: 0,
ignored: 0,
forwarded: 0,
publish: 3792,
publish_per_sec: 8,
publish_fail: 0,
slow_consumer_alert: 0,
subinupt_buffer: 2848,
instance_id: 0
}

publish_per_sec is about 10, therefore subinupt_buffer keep increasing.

OK case:

{
consume: 0,
consume_per_sec: 0,
consume_fail: 0,
consume_http_start_stop: 0,
consume_value_metric: 0,
consume_counter_event: 0,
consume_log_message: 34394,
consume_error: 0,
consume_container_metric: 0,
consume_unknown: 0,
ignored: 0,
forwarded: 0,
publish: 34394,
publish_per_sec: 381,
publish_fail: 0,
slow_consumer_alert: 0,
subinupt_buffer: 0,
instance_id: 0
}

publish_per_sec is enough large to logging at realtime.

CF Loggregator Components

I also checked CF loggregator component in PAS2.5.2 works properly. Each test, I got logs at realtime.

CF Components Version

https://docs.pivotal.io/pivotalcf/2-5/pcf-release-notes/runtime-rn.html

Component PAS2.4.2 PAS2.4.4 PAS2.5.2
ubuntu-xenial stemcell 170.15 170.30 250.29
backup-and-restore-sdk 1.12.0 1.12.0 1.12.0
binary-offline-buildpack 1.0.28 1.0.30 1.0.31
bosh-dns-aliases 0.0.3 0.0.3 0.0.3
bosh-system-metrics-forwarder 0.0.16 0.0.16 0.0.18
bpm 0.13.0 1.0.3 1.0.4
capi 1.71.4 1.71.4 1.76.6
cf-autoscaling 216 216 218
cf-backup-and-restore 0.0.11 0.0.11 0.0.11
cf-cli 1.9.0 1.9.0 1.13.0
cf-networking 2.18.2 2.18.2 2.20.0
cf-smoke-tests 40.0.40 40.0.40 40.0.40
cf-syslog-drain 8.0 8.1 8.2
cflinuxfs2 1.259.0 1.267.0 -
cflinuxfs3 0.50.0 0.62.0 0.80.0
consul-drain 0.0.3 0.0.3 0.0.3
consul 198 198 198
credhub 2.1.2 2.1.2 2.1.5
diego 2.22.1 2.22.1 2.27.6
dotnet-core-offline-buildpack 2.2.3 2.2.5 2.2.7
garden-runc 1.16.7 1.18.0 1.19.0
go-offline-buildpack 1.8.30 1.8.33 1.8.35
haproxy 9.3.0 9.3.0 9.4.1
istio - - 1.0.1
java-offline-buildpack 4.16.1 4.16.1 4.18
log-cache 2.0.1 2.0.1 2.1.1
loggregator-agent 2.2 2.3 2.3
loggregator 103.3 103.4 103.4
mapfs 1.1.2 1.1.4 1.1.4
metric-registrar 1.0.4 1.0.4 1.0.4
mysql-monitoring 9.1.0 9.1.0 9.2.0
nats 26 26 26
nfs-volume 1.7.6 1.7.7 1.7.8
nginx-offline-buildpack - - 1.0.9
nodejs-offline-buildpack 1.6.38 1.6.43 1.6.45
notifications-ui 36 36 36
notifications 54 54 55
php-offline-buildpack 4.3.67 4.3.70 4.3.72
push-apps-manager-release 667.0.3 667.0.6 668.0.9
push-usage-service-release 668.0.10 668.0.11 669.0.9
pxc 0.14.0 0.14.2 0.14.2
python-offline-buildpack 1.6.25 1.6.28 1.6.29
r-offline-buildpack - - 1.0.7
routing 0.182.0 0.182.3 0.187.1
ruby-offline-buildpack 1.7.29 1.7.31 1.7.36
silk 2.18.1 2.18.1 2.20.0
smb-volume 0.2.6 0.2.6 1.0.0
staticfile-offline-buildpack 1.4.37 1.4.39 1.4.40
statsd-injector 1.5.0 1.5.0 1.6.0
syslog 11.4.0 11.4.0 11.4.0
uaa 66.0 66.0 71.0
shinji62 commented 5 years ago

@daichi703n Thanks for the report. I guess will be better with the v2 implementation.

Can you test the develop branch please ?

lrstanley commented 5 years ago

Believe I'm noticing the same thing. We've got dual datacenter deploys, one 2.3, and one 2.5 (mid upgrades for everything). Will test out the older version and validate. If I can confirm, will send more details.

daichi703n commented 5 years ago

@shinji62 I tested new one, but publish_per_sec is still about under 10... Do I need some properties?

sba30 commented 4 years ago

@daichi703n Did you get any further with this issue?

daichi703n commented 4 years ago

@sba30 No, I don't get any further.

I re-tested with PAS2.6.3, but the result is same as before. A little things that I figured out, is the throughput depends on EVENT types as following.

EVENT Throughput
LogMessage under 10
ContainerMetric under 10
CounterEvent over 100
sba30 commented 4 years ago

@daichi703n can you try increasing this value to say 600s

cloudfoundry.firehose_cc_pull_interval: description: "full app metadata update interval; defaults to 60s" default: "60s"

and setting this true

cloudfoundry.ignore_missing_apps: description: "Enable throttling on cache lookup for missing apps" default: false

and see if that makes a difference? https://github.com/cloudfoundry-community/logsearch-for-cloudfoundry/blob/develop/jobs/ingestor_cloudfoundry-firehose/spec

daichi703n commented 4 years ago

@sba30

Thanks!!

cloudfoundry.ignore_missing_apps: NO improvements cloudfoundry.firehose_cc_pull_interval: Improved

I had deployed firehose-to-syslog app to CF with CF_PULL_TIME: 0 (env var same with cloudfoundry.firehose_cc_pull_interval).

applications:
- name: firehose-to-syslog
  buildpack: binary_buildpack
  command: ./firehose-to-syslog_linux_amd64
  ...
  env:
    GOPACKAGENAME: github.com/cloudfoundry-community/firehose-to-syslog
    API_ENDPOINT: https://api.sys.${DOMAIN}
    ...
    CF_PULL_TIME: 0 # -> 600s

After set CF_PULL_TIME: 600s (larger than 0) , performance improved!!

Maybe cachingClient (changed in 5.x) causes low performance because it is using tickerTime (cc-pull-time).

@lrstanley

Do your environment has same setting?