SumoLogic / fluentd-kubernetes-sumologic

FluentD plugin to extract logs from Kubernetes clusters, enrich and ship to Sumo logic.
Apache License 2.0
61 stars 65 forks source link

Missing collector for scheduled (success|failure) events #68

Open krx252525 opened 6 years ago

krx252525 commented 6 years ago

Primary Concern

I'd like some help to understand whether or not I've missed something when following the README and the guides on help.sumologic.com ... kubernetes.

I seem to have most dashboards working with the exception of scheduler related panels like Kubernetes - Overview -> Pods Scheduled By Namespace which is driven by the following query:

_sourceCategory = *kube-scheduler*
| timeslice 1h
| parse "Successfully assigned * to *\"" as name2,node
| parse "reason: '*'" as reason
| parse "type: '*'" as normal
| parse "Name:\\\"*\\\"" as name
| parse "Namespace:\\\"*\\\"" as namespace
| parse "Kind:\\\"*\\\"" as kind
| count by _timeslice, namespace
| transpose row _timeslice column namespace
| fillmissing timeslice(1h) 

The problem is that the line this query is driven by is not logged by the scheduler but emitted as an event. The only piece from the documentation which I can see which would be able to push this to sumo is the sumologic-k8s-api script which is noticeably lacking any calls the v1/api/events as well as the role for calling that.

I've tested a fix which would add these log lines and can submit it as a PR against sumologic-k8s-api but I feel like I've missed something obvious.

Secondary concern

I see some of the panels are driven by queries which extract fields which don't fill me with confidence that I've got things configured correctly: Kubernetes - Controller Manager -> Event Severity Trend using the following query:

_sourceCategory = *kube-controller-manager*
| parse "\"message\":\"*\"" as message
| parse "\"source\":\"*.*:*\"" as resource,resource_action,resource_code
| parse "\"severity\":\"*\"" as severity
| fields - resource_action, resource_code 
| timeslice 1h
| count _timeslice, severity 
| transpose row _timeslice column severity
| fillmissing timeslice(1h) 

Which matches this log line:

{
"timestamp": 1528785188171,
"severity": "I",
"pid": "1",
"source": "round_trippers.go:439", 
"message": "Response Status: 200 OK in 2 milliseconds"
}

Where resource_action, resource_code would match go and 439 respectively. Is this correct?

frankreno commented 6 years ago

@keir-rex can you provide the following information?

1) What version of k8s? 2) Where is it running? 3) Managed Service (GKE/EKS) or you manage the cluster (kops/kubeadm) 4) Can you share your YAML

These logs did exist at some point, very possible they have been tweaked in a new release or things have changed in the underlying logging of the scheduler so this will help me figure out what is going on.

krx252525 commented 6 years ago

@frankreno

  1. v1.9.6 (kubectl version output below):
  2. AWS
  3. kops
  4. Provided below

sumologic-k8s-api

I rebuilt your image to also hit /v1/api/events you can see the diff here:

        log.info("getting data for events")
        events = requests.get(url="{}/api/v1/events".format(self.k8s_api_url)).json()
        for event in events["items"]:
            log.info("pushing to sumo")
            requests.post(url=self.collector_url,
                          data=json.dumps(event),
                          headers=self.headers)
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: sumologic-k8s-api
  labels:
    app: sumologic-k8s-api
spec:
  schedule: "*/5 * * * *"
  successfulJobsHistoryLimit: 10
  failedJobsHistoryLimit: 10
  concurrencyPolicy: Replace
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccount: sumologic-k8s-api
          restartPolicy: OnFailure
          containers:
          - name:  sumologic-k8s-api
            imagePullPolicy: Always
            image: frankreno/sumologic-k8s-api:events
            env:
            - name: SUMO_HTTP_URL
              value: <INSERT_URL_HERE>
            - name: K8S_API_URL
              value: http://127.0.0.1:8001
            - name: X-Sumo-Category
              value: k8s/api
            - name: X-Sumo-Name
              value: sumologic-k8s-api
          - name:  kubectl
            image: gcr.io/google_containers/kubectl:v1.0.7
            command: ["/kubectl"]
            args: ["proxy", "-p", "8001"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: sumologic-k8s-api
  labels:
    app: sumologic-k8s-api
rules:
- apiGroups: [""]
  resources: ["nodes", "pods", "events"]
  verbs: ["get", "list"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sumologic-k8s-api
  labels:
    app: sumologic-k8s-api
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: sumologic-k8s-api
  labels:
    app: sumologic-k8s-api
subjects:
- kind: ServiceAccount
  name: sumologic-k8s-api
  namespace: default
roleRef:
  kind: ClusterRole
  name: sumologic-k8s-api
  apiGroup: rbac.authorization.k8s.io

fluentd-kubernetes-sumologic

is basically vanilla

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluentd

---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: fluentd
rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  - pods
  verbs:
  - get
  - list
  - watch

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: fluentd
roleRef:
  kind: ClusterRole
  name: fluentd
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: fluentd
  # This namespace setting will limit fluentd to watching/listing/getting pods in the default namespace. If you want it to be able to log your kube-system namespace as well, comment the line out.
  namespace: default

--- 
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluentd-sumologic
  labels:
    app: fluentd-sumologic
    version: v1
spec:
  template:
    metadata:
      labels:
        name: fluentd-sumologic
    spec:
      serviceAccountName: fluentd
      volumes:
      - name: pos-files
        emptyDir: {}
      - name: host-logs
        hostPath:
          path: /var/log/
      - name: docker-logs
        hostPath:
          path: /var/lib/docker
      containers:
      - image: sumologic/fluentd-kubernetes-sumologic:latest
        name: fluentd
        imagePullPolicy: Always
        volumeMounts:
        - name: host-logs
          mountPath: /mnt/log/
          readOnly: true
        - name: host-logs
          mountPath: /var/log/
          readOnly: true
        - name: docker-logs
          mountPath: /var/lib/docker/
          readOnly: true
        - name: pos-files
          mountPath: /mnt/pos/
        env:
        - name: COLLECTOR_URL
          valueFrom:
            secretKeyRef:
              name: sumologic
              key: collector-url
      tolerations:
          #- operator: "Exists"
          - effect: "NoSchedule"
            key: "node-role.kubernetes.io/master"

kubectl version:


Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-05-12T04:12:12Z", GoVersion:"go1.9.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}```
frankreno commented 6 years ago

@keir-rex thanks for the info. So this appears to be a change in 1.9.x. I have a 1.8 cluster and a 1.9 cluster and the schedule is not producing the same logs. Will try to track down to the source and work on remediation for this.

krx252525 commented 6 years ago

Cheers @frankreno let me know if there's anything I can help with

frankreno commented 6 years ago

@keir-rex still no response from the folks on the scheduling team for k8s. So I do not have a good answer as to why this changed and how to remedy yet. I found the code where the log used to be generated and see no changes to account for this, so just means the change is not coming from the scheduler, but somewhere else. Will keep you updated. Long term, we are working on a new metrics collection strategy for Kubernetes not using heapster which will allow us to collect from many more data sources and provide insights into this. Let's keep this issue open until we solve it one of those ways...

krx252525 commented 6 years ago

Sounds good @frankreno. I'll throw together something which does de-duping of events since we need that anyway.

Could you comment on my second query on my initial post?

Cheers

ankitgoelcmu commented 6 years ago

@keir-rex that's right. I see [218, 42, 205, 363 and 374] as code, 'event' as a resource, and 'go' as resource_action. Although, I have to revisit these to make sure these are proper naming conventions