cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.46k stars 794 forks source link

Distributor - Number of ingesters #1488

Closed Serrvosky closed 5 years ago

Serrvosky commented 5 years ago

Hello everyone,

Can any one tell me how much ingesters are necessary?

This is my distributor deployment file:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: distributor
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: distributor
    spec:
      containers:
      - name: distributor
        image: quay.io/cortexproject/cortex:master-6d684f65
        imagePullPolicy: IfNotPresent
        args:
        - -target=distributor
        - -log.level=debug
        - -server.http-listen-port=80
        - -consul.hostname=consul.default.svc.cluster.local:8500
        - -distributor.replication-factor=2
        ports:
        - containerPort: 80

and this is my ingesters deployment file:

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: ingester
spec:
  replicas: 5

  # Ingesters are not ready for at least 1 min
  # after creation.  This has to be in sync with
  # the ring timeout value, as this will stop a
  # stampede of new ingesters if we should loose
  # some.
  minReadySeconds: 60

  # Having maxSurge 0 and maxUnavailable 1 means
  # the deployment will update one ingester at a time
  # as it will have to stop one (making one unavailable)
  # before it can start one (surge of zero)
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1

  template:
    metadata:
      labels:
        name: ingester
    spec:
      # Give ingesters 40 minutes grace to flush chunks and exit cleanly.
      # Service is available during this time, as long as we don't stop
      # too many ingesters at once.
      terminationGracePeriodSeconds: 2400

      containers:
      - name: ingester
        image: quay.io/cortexproject/cortex:master-6d684f65
        imagePullPolicy: IfNotPresent
        args:
        - -target=ingester
        - -ingester.join-after=30s
        - -ingester.claim-on-rollout=true
        - -consul.hostname=consul.default.svc.cluster.local:8500
        - -s3.url=s3://abc:123@s3.default.svc.cluster.local:4569
        - -dynamodb.original-table-name=cortex
        - -dynamodb.url=dynamodb://user:pass@dynamodb.default.svc.cluster.local:8000
        - -dynamodb.periodic-table.prefix=cortex_weekly_
        - -dynamodb.periodic-table.from=2019-06-01
        - -dynamodb.daily-buckets-from=2019-06-01
        - -dynamodb.base64-buckets-from=2019-06-01
        - -dynamodb.v4-schema-from=2019-06-01
        - -dynamodb.v5-schema-from=2019-06-01
        - -dynamodb.v6-schema-from=2019-06-01
        - -dynamodb.chunk-table.from=2019-06-01
        - -memcached.hostname=memcached.default.svc.cluster.local
        - -memcached.timeout=100ms
        - -memcached.service=memcached
        ports:
        - containerPort: 80
        #readinessProbe:
        #  httpGet:
        #    path: /ready
        #    port: 80
        #  initialDelaySeconds: 15
        #  timeoutSeconds: 1

As you can see, I spin up 5 ingesters first, then I wait some time to everything comes up (pods, registry on consul, health checks, etc), and then I deploy the distributors.

However I'm getting a lot of logs like this:

level=warn ts=2019-07-02T10:07:48.719546323Z caller=logging.go:49 traceID=5909d18c1b0598b0 msg="POST /api/prom/push (500) 415.597µs Response: \"at least 4 live ingesters required, could only find 2\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 4990; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; X-Prometheus-Remote-Write-Version: 0.1.0; X-Scope-Orgid: 0; "
level=warn ts=2019-07-02T10:07:48.720214262Z caller=logging.go:49 traceID=42cb99b7e4d6403 msg="POST /api/prom/push (500) 404.171µs Response: \"at least 4 live ingesters required, could only find 2\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 4306; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; X-Prometheus-Remote-Write-Version: 0.1.0; X-Scope-Orgid: 0; "
level=warn ts=2019-07-02T10:07:48.720461758Z caller=logging.go:49 traceID=2ba6ed33a0136724 msg="POST /api/prom/push (500) 1.888748ms Response: \"at least 3 live ingesters required, could only find 2\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5861; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; X-Prometheus-Remote-Write-Version: 0.1.0; X-Scope-Orgid: 0; "
level=warn ts=2019-07-02T10:07:48.724230653Z caller=logging.go:49 traceID=5abb13d5575ad94b msg="POST /api/prom/push (500) 620.785µs Response: \"at least 3 live ingesters required, could only find 2\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5443; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; X-Prometheus-Remote-Write-Version: 0.1.0; X-Scope-Orgid: 0; "
level=warn ts=2019-07-02T10:07:48.808633897Z caller=logging.go:49 traceID=121b69b5928e939a msg="POST /api/prom/push (500) 1.934723ms Response: \"at least 3 live ingesters required, could only find 2\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5956; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; X-Prometheus-Remote-Write-Version: 0.1.0; X-Scope-Orgid: 0; "
level=warn ts=2019-07-02T10:07:48.811379216Z caller=logging.go:49 traceID=80038b1e98b8644 msg="POST /api/prom/push (500) 5.421684ms Response: \"at least 3 live ingesters required, could only find 2\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5718; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; X-Prometheus-Remote-Write-Version: 0.1.0; X-Scope-Orgid: 0; "
level=warn ts=2019-07-02T10:07:48.819939803Z caller=logging.go:49 traceID=1902ec7d178c424e msg="POST /api/prom/push (500) 521.814µs Response: \"at least 5 live ingesters required, could only find 2\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5924; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; X-Prometheus-Remote-Write-Version: 0.1.0; X-Scope-Orgid: 0; "
level=warn ts=2019-07-02T10:07:48.824741619Z caller=logging.go:49 traceID=7b7a37191590c41d msg="POST /api/prom/push (500) 358.946µs Response: \"at least 3 live ingesters required, could only find 2\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5833; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; X-Prometheus-Remote-Write-Version: 0.1.0; X-Scope-Orgid: 0; "
bboreham commented 5 years ago

Best way to troubleshoot this is to look at the status page using a browser (http request) to /ring on one of your distributors. This should give Cortex's internal view of what is active, unhealthy, etc. If it has some outdated information press 'forget' on that line.

bboreham commented 5 years ago

I think the basic question about sizing is now answered in docs/running.md