Open cyberhck opened 10 months ago
is it correlated with anything? Possible correlations to look at:
nothing that explains a 900 MB spike, and only for few milliseconds.
At 1:10 in the morning (our time, low traffic, no schema updates), we saw another spike that reached 2400MB
cc @Geal
can you share your configuration (removing anything sensitive ofc)? Do you use any native or rhai plugins?
I'm gonna upgrade to the 1.37.0 soon and see if the spikes persist.
Hello @Geal I've updated to apollo router 1.50.0 with the following configuration (helm)
release:
name: apollo-router-v1-50-0
namespace: honestcard
chart:
name: router
version: 1.50.0
values: |
# Default values for router.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
# -- See https://www.apollographql.com/docs/router/configuration/overview/#yaml-config-file for yaml structure
router:
configuration:
supergraph:
listen: 0.0.0.0:8080
path: /*
experimental_log_on_broken_pipe: true
introspection: false
health_check:
listen: 0.0.0.0:8081
telemetry:
exporters:
logging:
common:
service_name: apollo-router
stdout:
tty_format: json
enabled: true
format: json
metrics:
common:
service_name: apollo-router
attributes:
supergraph:
static:
- name: "component"
value: "supergraph"
request:
header:
- named: "apollographql-client-name"
- named: "apollographql-client-version"
subgraph:
all:
static:
- name: "component"
value: "subgraph"
request:
header:
- named: "apollographql-client-name"
- named: "apollographql-client-version"
- named: "operation-name"
prometheus:
enabled: true
listen: 0.0.0.0:8082
path: "/metrics"
apq:
enabled: true
csrf:
required_headers:
- "x-apollo-operation-name"
- "apollo-require-preflight"
- "apollographql-client-name"
- "apollographql-client-version"
- "user-agent"
- "accept-language"
limits:
http_max_request_bytes: 800000000
include_subgraph_errors:
all: true
subscription:
enabled: false
headers:
all:
request:
- propagate:
named: apollographql-client-name
- propagate:
named: apollographql-client-version
- propagate:
named: traceparent
- propagate:
named: user-agent
- propagate:
named: x-timestamp
- propagate:
named: x-is-signature-valid
- propagate:
named: x-user-id
- propagate:
named: x-raw-authorization-token
- propagate:
named: x-user-agent
- propagate:
named: x-origin-ip
- propagate:
named: x-scopes
- propagate:
named: accept-language
- propagate:
named: x-anonymous-id
- remove:
named: authorization
- insert:
name: "operation-name"
path: ".operationName" # It's a JSON path query to fetch the operation name from request body
default: "UNKNOWN" # If no operationName has been specified
traffic_shaping:
router:
timeout: 40s # router timeout should be >= than maximum timeout to any subgraph https://www.apollographql.com/docs/router/configuration/traffic-shaping/#timeouts
subgraphs:
alternative-payments-service:
timeout: 40s # alternative-payments-service may return response within 32s (timeout to Alto) in some edge cases and this response must be propagated to clients
plugins:
experimental.expose_query_plan: true
sandbox:
enabled: false # false for qa and prod
homepage:
enabled: true # true for qa and prod
graph_ref: Honest-API@prod
containerPorts:
# -- If you override the port in `router.configuration.server.listen` then make sure to match the listen port here
http: 8080
# -- For exposing the health check endpoint
health: 8081
# -- For exposing the metrics port when running a serviceMonitor for example
metrics: 8082
managedFederation:
# -- If using managed federation, the graph API key to identify router to Studio
existingSecret: "public-apollo-graph-api-key"
# -- If using managed federation, use existing Secret which stores the graph API key instead of creating a new one.
# If set along `managedFederation.apiKey`, a secret with the graph API key will be created using this parameter as name
# -- If using managed federation, the variant of which graph to use
graphRef: "Honest-API@prod"
# This should not be specified in values.yaml. It's much simpler to use --set-file from helm command line.
# e.g.: helm ... --set-file supergraphFile="location of your supergraph file"
extraVolumes: [] # todo: for now, we won't use rhai scripts.
image:
repository: ghcr.io/apollographql/router
pullPolicy: IfNotPresent
serviceAccount:
# Specifies whether a service account should be created
create: true
# Annotations to add to the service account
annotations: {}
podAnnotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
securityContext: {}
service:
type: ClusterIP
port: 80
annotations: {} # how to tell this service will be owned by spend?
serviceMonitor:
enabled: true
ingress:
enabled: false
virtualservice:
enabled: false
serviceentry:
enabled: false
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 15
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 70
# -- Sets the [pod disruption budget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) for Deployment pods
podDisruptionBudget:
maxUnavailable: 2
# -- Sets the [termination grace period](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#hook-handler-execution) for Deployment pods
terminationGracePeriodSeconds: 30
probes:
# -- Configure readiness probe
readiness:
initialDelaySeconds: 3
# -- Configure liveness probe
liveness:
initialDelaySeconds: 15
resources:
limits:
cpu: 150m
memory: 300Mi
requests:
cpu: 100m
memory: 200Mi
extraEnvVars:
- name: APOLLO_ROUTER_LOG
value: info
I'm still seeing a bunch of spikes that triggers an OOM killed, I know this is not enough for debugging something like this, is there anything I can do to help?
I can try and get heaptrack profiles if we merge #5850
Describe the bug I was running router with 256Mi of memory, it kept getting OOM killed by Kubernetes, but the limits were high enough to get us really confused, so we set the memory request and limit both at 512Mi for a guaranteed QOS by k8s, unfortunately, I still got OOM killed and from checking GKE logs, we found that the amount of memory it was allocating was
2986612kB
which is about2.98GB
, and we know it was a spike and not a gradual change because it was too fast for our Prometheus to get this value or HPA to trigger.So we gave it 4GB of memory as a test, and it has been performing well, it idles at about 80-90MB, but we have seen some very unexpected spikes out of nowhere, as you can see in the screenshot, it's a very sudden spike that reached over 500MB.
Even when ignoring that one instance, we see bunch of other spikes happening periodically, while we do not see the spike as big as 2GB anymore, we're curious why this is appening
To Reproduce Steps to reproduce the behavior:
Expected behavior Not too big of spikes should be present.
Output If applicable, add output to help explain your problem.
Desktop (please complete the following information):
Additional context Version: 1.28.1 Running on GKE k8s