erigontech / erigon

Ethereum implementation on the efficiency frontier https://erigon.gitbook.io
GNU Lesser General Public License v3.0
3.12k stars 1.11k forks source link

healthcheck logic improve #8754

Open BlinkyStitt opened 10 months ago

BlinkyStitt commented 10 months ago

Similar to https://github.com/ledgerwatch/erigon/issues/8752, the health check lies.

Here are logs showing my node is on step 7/15:

Nov 17 02:07:25 i-051c9b46a47292e6a erigon.sh[3687126]: [INFO] [11-17|02:07:25.659] [7/15 Execution] Executed blocks         number=49702498 blk/s=8.4 tx/s=494.0 Mgas/s=129.1 gasState=0.08 batch=70.7MB alloc=1.2GB sys=6.3GB

And here is curl of eth_syncing saying it isn't syncing:

$ curl localhost:8545 -X POST --data '{"jsonrpc":"2.0","method":"eth_syncing","id":1}' -H "Content-Type: application/json"
{"jsonrpc":"2.0","id":1,"result":false}

And here is curl of /health saying it isn't syncing:

$ curl --fail-with-body http://localhost:8545/health --header "X-ERIGON-HEALTHCHECK: max_seconds_behind60" --header "X-ERIGON-HEALTHCHECK: min_peer_count3" --header "X-ERIGON-HEALTHCHECK: synced"
{"check_block":"DISABLED","max_seconds_behind":"HEALTHY","min_peer_count":"HEALTHY","synced":"HEALTHY"}

When I captured these logs, the server was ~8 days behind. So max_seconds_behind60 definitely should be UNHEALTHY

rarecrumb commented 10 months ago

Same here, running on Sepolia.

Erigon version: v2.55.0

StatefulSet manifest:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/instance: ethereum
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: erigon
    argocd.argoproj.io/instance: ethereum
    helm.sh/chart: erigon-1.0.8
  name: erigon
  namespace: ethereum
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: ethereum
      app.kubernetes.io/name: erigon
  serviceName: erigon-headless
  template:
    metadata:
      annotations:
        checksum/secrets: 3b29556c4c07d2ac10020f254dab589e6e9c93c8618e7a311d0dcf28be2383e8
        prometheus.io/path: /debug/metrics/prometheus
        prometheus.io/port: "6061"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: ethereum
        app.kubernetes.io/name: erigon
    spec:
      containers:
      - command:
        - sh
        - -ac
        - |
          exec erigon --datadir=/data --nat=extip:$(POD_IP) --port=30303 --http=false --private.api.addr=127.0.0.1:9090 --authrpc.jwtsecret=/data/jwt.hex --authrpc.addr=0.0.0.0 --authrpc.port=8551 --authrpc.vhosts=* --metrics --metrics.addr=0.0.0.0 --metrics.port=6060 --chain=sepolia --internalcl --log.console.json=true --log.console.verbosity=info --log.dir.disable=true --maxpeers=200 --torrent.download.rate=1000mb --torrent.download.slots=100
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        image: xxx.dkr.ecr.us-east-1.amazonaws.com/erigon:v2.55.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 60
          periodSeconds: 120
          successThreshold: 1
          tcpSocket:
            port: metrics
          timeoutSeconds: 1
        name: erigon
        ports:
        - containerPort: 30303
          name: p2p-tcp
          protocol: TCP
        - containerPort: 30303
          name: p2p-udp
          protocol: UDP
        - containerPort: 8551
          name: auth-rpc
          protocol: TCP
        - containerPort: 6060
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: metrics
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "2"
            memory: 10Gi
          requests:
            cpu: "1"
            memory: 8Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /data
          name: storage
        - mountPath: /data/jwt.hex
          name: jwt
          readOnly: true
          subPath: jwt.hex
      - command:
        - sh
        - -ac
        - |
          while ! nc -z 127.0.0.1 9090; do sleep 1; done; exec rpcdaemon --datadir=/data --private.api.addr=127.0.0.1:9090 --txpool.api.addr=127.0.0.1:9090 --http.addr=0.0.0.0 --http.port=8545 --http.vhosts=* --metrics --metrics.addr=0.0.0.0 --metrics.port=6061 --http.api=eth,erigon,web3,net,debug,trace,txpool,db --ws
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        image: xxx.dkr.ecr.us-east-1.amazonaws.com/erigon:v2.55.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 60
          periodSeconds: 120
          successThreshold: 1
          tcpSocket:
            port: http-rpc
          timeoutSeconds: 1
        name: erigon-rpcd
        ports:
        - containerPort: 8545
          name: http-rpc
          protocol: TCP
        - containerPort: 6061
          name: metrics-rpcd
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            httpHeaders:
            - name: Accept
              value: application/json
            - name: X-ERIGON-HEALTHCHECK
              value: min_peer_count2
            - name: X-ERIGON-HEALTHCHECK
              value: synced
            - name: X-ERIGON-HEALTHCHECK
              value: max_seconds_behind60
            path: /health
            port: http-rpc
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "2"
            memory: 10Gi
          requests:
            cpu: "1"
            memory: 8Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /data
          name: storage
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - chown
        - -R
        - 10001:10001
        - /data
        image: busybox:1.34.0
        imagePullPolicy: IfNotPresent
        name: init-chown-data
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /data
          name: storage
      nodeSelector:
        group: ethereum
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 10001
        runAsGroup: 10001
        runAsNonRoot: true
        runAsUser: 10001
      serviceAccount: erigon
      serviceAccountName: erigon
      shareProcessNamespace: true
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: group
        operator: Equal
        value: ethereum
      volumes:
      - name: jwt
        secret:
          defaultMode: 420
          secretName: erigon-jwt
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: storage
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Ti
      storageClassName: fast-gp3
      volumeMode: Filesystem
    status:
      phase: Pending

Latest block call:

$ curl -s -X POST --header 'Content-Type: application/json' localhost:8545 --data '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["latest", false],"id":1}' | jq .result.number | xargs printf "%d\n"

3999999

As of this comment, Sepolia is currently at: 4822615

Health check shows "HEALTHY":

$ curl -H "X-ERIGON-HEALTHCHECK: synced" -H "X-ERIGON-HEALTHCHECK: max_seconds_behind10" localhost:8545/health && echo
{"check_block":"DISABLED","max_seconds_behind":"HEALTHY","min_peer_count":"DISABLED","synced":"HEALTHY"}