fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.74k stars 1.56k forks source link

could not pack/validate JSON response #1679

Open acarsercan opened 4 years ago

acarsercan commented 4 years ago

Bug Report

Describe the bug I'm running fluent-bit inside kubernetes to pass pod log entries to Elastisearch.

I can see majority of the logs coming in, however I've noticed the following errors from fluent-bit pods running on each kubernetes node.

How can I identify which pod is causing the error? How can I resolve the issue below?

logs/fluent-bit-zjbxx[fluent-bit]: [2019/10/23 07:16:53] [error] [out_es] could not pack/validate JSON response 
logs/fluent-bit-nqdwb[fluent-bit]: [2019/10/23 07:17:00] [error] [out_es] could not pack/validate JSON response

Environment Kubernetes 1.13 fluent-bit helm chart fluent-bit 1.2.2

Configmap from kubernetes

data:
  fluent-bit-filter.conf: |-
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_Tag_Prefix     kube.var.log.containers.
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
        Merge_Log_Trim On
  fluent-bit-input.conf: "[INPUT]\n    Name             tail\n    Path             /var/log/containers/*.log\n
    \   Parser           docker\n    Tag              kube.*\n    Refresh_Interval
    5\n    Mem_Buf_Limit    5MB\n    Skip_Long_Lines  On\n    "
  fluent-bit-output.conf: "\n[OUTPUT]\n    Name  es\n    Match *\n    Host  XXXX\n
    \   Port  9200\n    Logstash_Format On\n    Retry_Limit False\n    Type  flb_type\n
    \   Time_Key @timestamp\n    Replace_Dots On\n    Logstash_Prefix kubernetes_cluster\n\n
    \   HTTP_User admin\n    HTTP_Passwd XXXX\n\n    tls on\n    tls.verify
    off\n    tls.debug 1\n    tls.ca_file /secure/es-tls-ca.crt\n\n    "
  fluent-bit-service.conf: |-
    [SERVICE]
        Flush        1
        Daemon       Off
        Log_Level    info
        Parsers_File parsers.conf
        Parsers_File parsers_custom.conf
  fluent-bit.conf: |-
    @INCLUDE fluent-bit-service.conf
    @INCLUDE fluent-bit-input.conf
    @INCLUDE fluent-bit-filter.conf
    @INCLUDE fluent-bit-output.conf
kind: ConfigMap
metadata:
  annotations:
    flux.weave.works/antecedent: logs:helmrelease/fluent-bit
  creationTimestamp: "2019-05-30T08:05:58Z"
  labels:
    app: fluent-bit
    chart: fluent-bit-2.7.3
    heritage: Tiller
    release: fluent-bit
  name: fluent-bit-config
  namespace: logs
  resourceVersion: "99683776"
  selfLink: /api/v1/namespaces/logs/configmaps/fluent-bit-config
  uid: c1c81cd4-82b1-11e9-bace-06a20d1eb048
edsiper commented 4 years ago

Would you please upgrade to Fluent Bit v1.3.2 and let us know how it behaves?

Note: just in case, always read the upgrade notes: https://docs.fluentbit.io/manual/installation/upgrade_notes

acarsercan commented 4 years ago

Thanks for the reply, I've upgraded since my query but unfortunately it has mad no change. Is there anyway for me to track which pod/app is causing [out_es] could not pack/validate JSON response

I've enabled debug in fluent-bit but unfortunately I'm failing to make much sense from the output

[2019/11/01 12:13:26] [debug] [out_es Elasticsearch response
{"took":3257,"errors":false,"items":[{"index":{"_index":"kubernetes_cluster-2019.10.31","_type":"_doc","_id":"2AriJm4BBmTb-erSaYRK","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":31403479,"_primary_term":1,"status":201}},{"index":{"_index":"kubernetes_cluster-2019.10.31","_type":"_doc","_id":"2QriJm4BBmTb-erSaYRK","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":31403481,"_primary_term":1,"status":201}},{"index":{"_index":"kubernetes_cluster-2019.10.31","_type":"_doc","_id":"2griJm4BBmTb-erSaYRK","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":31403483,"_primary_term":1,"status":201}},{"index":{"_index":"kubernetes_cluster-2019.10.31","_type":"_doc","_id":"2wriJm4BBmTb-erSaYRK","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":31403485,"_primary_term":1,"status":201}},{"index
[2019/11/01 12:13:26] [debug] [filter_kube] could not merge JSON log as requested
[2019/11/01 12:13:26] [debug] [filter_kube] could not merge JSON log as requested
[2019/11/01 12:13:26] [debug] [filter_kube] could not merge JSON log as requested
[2019/11/01 12:13:26] [debug] [filter_kube] could not merge JSON log as requested
[2019/11/01 12:13:26] [debug] [in_tail] file=/var/log/containers/calico-node-rqljw_kube-system_calico-node-495f261132e303ef2258badecb7448130df05c6898963b51862dd360f5ade638.log read=32575 lines=116
[2019/11/01 12:13:26] [debug] [in_tail] file=/var/log/containers/concourse-web-8d955d58d-xhnmm_concourse_concourse-web-3768034ad7be3f792006afa2e9e3b8eb772d065b632d7fe08d280d3b37cbea46.log read=32445 lines=104
[2019/11/01 12:13:27] [debug] [filter_kube] could not merge JSON log as requested
[2019/11/01 12:13:27] [debug] [filter_kube] could not merge JSON log as requested
[2019/11/01 12:13:27] [debug] [in_tail] file=/var/log/containers/nginx-ingress-internal-controller-574d88b97d-4ntgg_nginx-ingress-internal_nginx-ingress-controller-9e22c1fe05fcbf8213b7e5b3c18642a7e61fb5813ecc96f8bb4eb0baa7cbb859.log read=32606 lines=86
[2019/11/01 12:13:27] [debug] [in_tail] file=/var/log/containers/production-dynamo-latest-schedule-service-6c55cd95dc-47wf2_production-dynamo_dynamo-latest-schedule-service-8aeb20c3bd60fd2d8f0caef778e500e40b3d8b00510494b06e1900c5af4ca83c.log read=6696 lines=18
[2019/11/01 12:13:27] [debug] [filter_kube] could not merge JSON log as requested
[2019/11/01 12:13:27] [debug] [filter_kube] could not merge JSON log as requested
edsiper commented 4 years ago

So the main requirement is: "how do I identify which Pod is not logging in JSON? "

I will improve the message for debugging purposes, makes sense.

shivshankarb commented 4 years ago

Hi @edsiper,

I come across with this error against Fluent Bit v1.3.7. Below is the log entry and we can clear see that closing braces '}}' are missing from the request. Could this be related to the fact that it reached to the allowed character limit for the log document? And it is striping out the rest of the request. Below is my log entry for your reference. All my log entries it is failing has 937 character

[2020/02/28 08:08:53] [error] [out_es] could not pack/validate JSON response {"took":203, "errors":true, "items":[ {"index":{ "_index":"logs-2020.02.10", "_type":"_doc","_id":"xxxxxxxxxxxxxxxxxxx", "_version":1,"result":"created", "_shards":{"total":2,"successful":2,"failed":0}, "_seq_no":11379, "_primary_term":1, "status":201 } }, {"index":{ "_index":"logs-2020.02.10", "_type":"_doc", "_id":"xxxxxxxxxxxxxxxxxxx", "_version":1, "result":"created", "_shards":{"total":2,"successful":2,"failed":0}, "_seq_no":11455, "_primary_term":1, "status":201 } }, {"index":{ "_index":"logs-2020.02.10", "_type":"_doc", "_id":"xxxxxxxxxxxxxxxxxxx", "_version":1, "result":"created", "_shards":{"total":2,"successful":2,"failed":0}, "_seq_no":11380,"_primary_term":1,"status":201 } }, {"index":{ "_index":"logs-2020.02.10", "_type":"_doc", "_id":"xxxxxxxxxxxxxxxxxxx", "_version":1, "result":"created", "_shards":{"total":2,"successful":2,"failed":0}, "_seq_no":11456, "_primary_term":1, "status" [2020/02/28 08:08:53] [ warn] [engine] failed to flush chunk '1-********************.flb', retry in 867 seconds: task_id=14, input=tail.0 > output=es.0 [2020/02/28 08:23:20] [error] [out_es] could not pack/validate JSON response {"took":97,"errors":true,"items":[{"index":{"_index":"logs-2020.02.10","_type":"_doc","_id":"xxxxxxxxxxxxxxxxxxx" Regards, Shiv

shivshankarb commented 4 years ago

Hi @edsiper, I observed that this error is followed by the warning against specific document only..

` [2020/02/28 12:50:29] [ warn] [engine] failed to flush chunk '1-1582845332.575759250.flb', retry in 1863 seconds: task_id=14, input=tail.0 > output=es.0 . . .

Feb 28, 2020 @ 00:09:46.200 | [2020/02/28 00:09:46] [ warn] [engine] failed to flush chunk '1-1582845332.575759250.flb', retry in 364 seconds: task_id=14, input=tail.0 > output=es.0 ` Is there any way I can get rid of this specific document or instrcut flunt-bit to ignore? Regards, Shiv

andreabolandrina commented 4 years ago

Hi,

I'm getting something similar:

[2020/09/14 17:51:42] [error] [out_es] could not pack/validate JSON response {"took":146,"errors":true,"items":[{"index":{"_index":"k8s-qa-2020.09.14-00001","_type":"_doc","_id":"5t2-jXQBx8Yze5ALb7r5","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":841544,"_primary_term":1,"status":201}},{"index":{"_index":"k8s-qa-2020.09.14-00001","_type":"_doc","_id":"592-jXQBx8Yze5ALb7r5","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":841545,"_primary_term":1,"status":201}},{"index":{"_index":"k8s-qa-2020.09.14-00001","_type":"_doc","_id":"6N2-jXQBx8Yze5ALb7r5","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":841546,"_primary_term":1,"status":201}},{"index":{"_index":"k8s-qa-2020.09.14-00001","_type":"_doc","_id":"6d2-jXQBx8Yze5ALb7r5","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":841547,"_primary_term":1,"status":201}},{"index":{"_index":"k8s-q

Since I first saw those errors I've upgraded the helm chart from "2.8.14" to "2.8.16". I'll see whether it makes any difference. Anyway the error pasted above showed up after the upgrade.

Also, logs like the one above, seem to be getting truncated too. Not sure whether there is a parameter to avoid truncation...

Regards, Andrea