unexplained memory spikes in router

cyberhck commented 10 months ago

Describe the bug I was running router with 256Mi of memory, it kept getting OOM killed by Kubernetes, but the limits were high enough to get us really confused, so we set the memory request and limit both at 512Mi for a guaranteed QOS by k8s, unfortunately, I still got OOM killed and from checking GKE logs, we found that the amount of memory it was allocating was 2986612kB which is about 2.98GB, and we know it was a spike and not a gradual change because it was too fast for our Prometheus to get this value or HPA to trigger.

So we gave it 4GB of memory as a test, and it has been performing well, it idles at about 80-90MB, but we have seen some very unexpected spikes out of nowhere, as you can see in the screenshot, it's a very sudden spike that reached over 500MB.

Even when ignoring that one instance, we see bunch of other spikes happening periodically, while we do not see the spike as big as 2GB anymore, we're curious why this is appening

To Reproduce Steps to reproduce the behavior:

deploy router on k8s
monitor memory usage
See spikes

Expected behavior Not too big of spikes should be present.

Output If applicable, add output to help explain your problem.

Desktop (please complete the following information):

Additional context Version: 1.28.1 Running on GKE k8s

Geal commented 10 months ago

is it correlated with anything? Possible correlations to look at:

increase in traffic
new graphql queries (for which there's no query plan in cache yet)
schema or configuration updates

cyberhck commented 10 months ago

no increase in traffic
no new gql queries
I'm unsure of the 3rd one, but shouldn't be. (I'll verify this again tomorrow)

nothing that explains a 900 MB spike, and only for few milliseconds.

cyberhck commented 10 months ago

At 1:10 in the morning (our time, low traffic, no schema updates), we saw another spike that reached 2400MB

cc @Geal

Geal commented 10 months ago

can you share your configuration (removing anything sensitive ofc)? Do you use any native or rhai plugins?

cyberhck commented 10 months ago

I'm gonna upgrade to the 1.37.0 soon and see if the spikes persist.

cyberhck commented 3 months ago

Hello @Geal I've updated to apollo router 1.50.0 with the following configuration (helm)

release:
  name: apollo-router-v1-50-0
  namespace: honestcard
chart:
  name: router
  version: 1.50.0
values: |
  # Default values for router.
  # This is a YAML-formatted file.
  # Declare variables to be passed into your templates.

  # -- See https://www.apollographql.com/docs/router/configuration/overview/#yaml-config-file for yaml structure
  router:
    configuration:
      supergraph:
        listen: 0.0.0.0:8080
        path: /*
        experimental_log_on_broken_pipe: true
        introspection: false
      health_check:
        listen: 0.0.0.0:8081
      telemetry:
        exporters:
          logging:
            common:
              service_name: apollo-router
            stdout:
              tty_format: json
              enabled: true
              format: json
          metrics:
            common:
              service_name: apollo-router
              attributes:
                supergraph:
                  static:
                    - name: "component"
                      value: "supergraph"
                  request:
                    header:
                      - named: "apollographql-client-name"
                      - named: "apollographql-client-version"
                subgraph:
                  all:
                    static:
                      - name: "component"
                        value: "subgraph"
                    request:
                      header:
                        - named: "apollographql-client-name"
                        - named: "apollographql-client-version"
                        - named: "operation-name"
            prometheus:
              enabled: true
              listen: 0.0.0.0:8082
              path: "/metrics"
      apq:
        enabled: true
      csrf:
        required_headers:
          - "x-apollo-operation-name"
          - "apollo-require-preflight"
          - "apollographql-client-name"
          - "apollographql-client-version"
          - "user-agent"
          - "accept-language"
      limits:
        http_max_request_bytes: 800000000
      include_subgraph_errors:
        all: true
      subscription:
        enabled: false
      headers:
        all:
          request:
            - propagate:
                named: apollographql-client-name
            - propagate:
                named: apollographql-client-version
            - propagate:
                named: traceparent
            - propagate:
                named: user-agent
            - propagate:
                named: x-timestamp
            - propagate:
                named: x-is-signature-valid
            - propagate:
                named: x-user-id
            - propagate:
                named: x-raw-authorization-token
            - propagate:
                named: x-user-agent
            - propagate:
                named: x-origin-ip
            - propagate:
                named: x-scopes
            - propagate:
                named: accept-language
            - propagate:
                named: x-anonymous-id
            - remove:
                named: authorization
            - insert:
                name: "operation-name"
                path: ".operationName" # It's a JSON path query to fetch the operation name from request body
                default: "UNKNOWN" # If no operationName has been specified
      traffic_shaping:
        router:
          timeout: 40s # router timeout should be >= than maximum timeout to any subgraph https://www.apollographql.com/docs/router/configuration/traffic-shaping/#timeouts
        subgraphs:
          alternative-payments-service:
            timeout: 40s # alternative-payments-service may return response within 32s (timeout to Alto) in some edge cases and this response must be propagated to clients
      plugins:
        experimental.expose_query_plan: true
      sandbox:
        enabled: false # false for qa and prod
      homepage:
        enabled: true # true for qa and prod
        graph_ref: Honest-API@prod
  containerPorts:
    # -- If you override the port in `router.configuration.server.listen` then make sure to match the listen port here
    http: 8080
    # -- For exposing the health check endpoint
    health: 8081
    # -- For exposing the metrics port when running a serviceMonitor for example
    metrics: 8082
  managedFederation:
    # -- If using managed federation, the graph API key to identify router to Studio
    existingSecret: "public-apollo-graph-api-key"
    # -- If using managed federation, use existing Secret which stores the graph API key instead of creating a new one.
    # If set along `managedFederation.apiKey`, a secret with the graph API key will be created using this parameter as name
    # -- If using managed federation, the variant of which graph to use
    graphRef: "Honest-API@prod"
  # This should not be specified in values.yaml. It's much simpler to use --set-file from helm command line.
  # e.g.: helm ... --set-file supergraphFile="location of your supergraph file"
  extraVolumes: [] # todo: for now, we won't use rhai scripts.
  image:
    repository: ghcr.io/apollographql/router
    pullPolicy: IfNotPresent
  serviceAccount:
    # Specifies whether a service account should be created
    create: true
    # Annotations to add to the service account
    annotations: {}
  podAnnotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
  securityContext: {}
  service:
    type: ClusterIP
    port: 80
    annotations: {} # how to tell this service will be owned by spend?
  serviceMonitor:
    enabled: true
  ingress:
    enabled: false
  virtualservice:
    enabled: false
  serviceentry:
    enabled: false
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 15
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 70
  # -- Sets the [pod disruption budget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) for Deployment pods
  podDisruptionBudget:
    maxUnavailable: 2
  # -- Sets the [termination grace period](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#hook-handler-execution) for Deployment pods
  terminationGracePeriodSeconds: 30
  probes:
    # -- Configure readiness probe
    readiness:
      initialDelaySeconds: 3
    # -- Configure liveness probe
    liveness:
      initialDelaySeconds: 15
  resources:
    limits:
      cpu: 150m
      memory: 300Mi
    requests:
      cpu: 100m
      memory: 200Mi
  extraEnvVars:
    - name: APOLLO_ROUTER_LOG
      value: info

I'm still seeing a bunch of spikes that triggers an OOM killed, I know this is not enough for debugging something like this, is there anything I can do to help?

cyberhck commented 3 months ago

I can try and get heaptrack profiles if we merge #5850

apollographql / router

unexplained memory spikes in router #4495