apollographql / router

A configurable, high-performance routing runtime for Apollo Federation 🚀
https://www.apollographql.com/docs/router/
Other
807 stars 272 forks source link

We are unable to disable schema refresh functionality in Apollo Router #2674

Open marc-barry opened 1 year ago

marc-barry commented 1 year ago

Describe the bug

We have deployed Apollo Router on our Kubernetes cluster, serving a significant amount of traffic. We are using Apollo Router with managed federation and thus according to https://www.apollographql.com/docs/router/configuration/overview#environment-variables we have specified the following:

As part of our backend service deployment, we have a need to control when Apollo Router loads the new schema from managed federation. It is our experience that there is no way to disable the hot reload functionality in router even though we don't specify the https://www.apollographql.com/docs/router/configuration/overview#--hr----hot-reload argument and we even set APOLLO_ROUTER_HOT_RELOAD=false.

To Reproduce Steps to reproduce the behavior:

  1. Use https://github.com/apollographql/router/tree/dev/helm/chart/router to deploy the router to a Kubernetes cluster.
  2. Ensure that in the values.yaml you provide to the Helm chart install you specify args: [] so that the default args, which includes --hot-reload isn't provided to the deployment: https://github.com/apollographql/router/blob/dev/helm/chart/router/values.yaml#LL22C5-L22C17.
  3. Set APOLLO_ROUTER_HOT_RELOAD=false. Note, this is optional as we removed the command line argument from above and so this shouldn't be necessary but we do it anyhow as want to ensure it is disabled.
  4. Make sure to setup managed federation in values: https://github.com/apollographql/router/blob/dev/helm/chart/router/values.yaml#L24-L31
  5. Start Apollo Router and it will pull the supergraph and it will be able to route.
  6. Publish a subgraph schema update via https://www.apollographql.com/docs/federation/managed-federation/overview/.
  7. In the logs of the router you will see logs which state reloading and reload complete which are reported as an INFO log.

Expected behavior

The hot reload functionality should be disabled and it shouldn't hot reload when it isn't enabled and/or it is explicitly disabled. There appears to be no way to disable it and one is opted into regardless not specifying --hot-reload.

Output

You get info log messages which say reloading and reload complete. We see a significant increase in CPU usage and request latency during hot reloading at our scale. We can see this happen when the hot reload occurs. We accept that hot reload behaves this way but we don't want it to happen at all.

Desktop (please complete the following information):

Additional context

We have also set APOLLO_TELEMETRY_DISABLED=true to see if that has any impact. It didn't appear to do so.

Below is an example of our configuration file (with some small things redacted):

cors:
  allow_credentials: true
  origins:
  - http://localhost:4000
headers:
  all:
    request:
    - propagate:
        matching: .*
health_check:
  listen: 0.0.0.0:8088
include_subgraph_errors:
  all: true
rhai:
  main: main.rhai
  scripts: /dist/rhai
supergraph:
  listen: 0.0.0.0:80
  path: /graphql
telemetry:
  metrics:
    common:
      resources:
        service.name: apollo-router
    prometheus:
      enabled: true
      listen: 0.0.0.0:9090
      path: /metrics
  tracing:
    otlp:
      batch_processor:
        max_concurrent_exports: 1000
        max_export_batch_size: 10000
        max_export_timeout: 100s
        max_queue_size: 10000
        scheduled_delay: 100ms
      endpoint: chronocollector-traces.chronosphere.svc.cluster.local:4317
      protocol: grpc
    propagation:
      trace_context: true
    trace_config:
      service_name: apollo-router
traffic_shaping:
  all:
    compression: gzip
    deduplicate_query: false
    timeout: 95s
  deduplicate_variables: false
  router:
    timeout: 95s
marc-barry commented 1 year ago

The reloading message seems to come from https://github.com/apollographql/router/blob/2ba02921504c7a2ecf20348ab14cf6e771250a82/apollo-router/src/state_machine.rs#L159 and the reload complete comes from https://github.com/apollographql/router/blob/2ba02921504c7a2ecf20348ab14cf6e771250a82/apollo-router/src/state_machine.rs#L175.

garypen commented 1 year ago

I don't if it's part of the problem, but your step 3 should be:

APOLLO_ROUTER_HOT_RELOAD=false (not true)

marc-barry commented 1 year ago

I don't if it's part of the problem, but your step 3 should be:

APOLLO_ROUTER_HOT_RELOAD=false (not true)

Thank you. I made a mistake when typing out the bug report. I confirmed in the pod and the deployment that we have APOLLO_ROUTER_HOT_RELOAD=false.

marc-barry commented 1 year ago

Is it relatively easy for me to produce a build which just comments out the hot reload code (or bypasses it altogether)? I'm not a Rust developer and so I'd like some input on what I could do to produce a build that can't possibly have hot reload enabled.

bnjjj commented 1 year ago

Maybe a solution could be to download your schema with rover and then run the router with the supergraph schema as a file with -s to pass the supergraph schema file or via the env variable then it won't reload when your schema change on studio

marc-barry commented 1 year ago

Maybe a solution could be to download your schema with rover and then run the router with the supergraph schema as a file with -s to pass the supergraph schema file or via the env variable then it won't reload when your schema change on studio

That's not a bad idea. That would also allow us to lessen/eliminate our need for managed federation. We already use rover to produce the supergraph for development and so adapting our production deployment to use the same supergraph might indeed be a possibility.

garypen commented 1 year ago

Just to verify that things are working properly.

If you have access to a running pod, you can do something like this:

kubectl -n <namespace of your router> exec -it <name of your router pod> -- bash
apt update
apt install procps
cat /proc/1/cmdline 

You will see something which looks either like:

/dist/router--hot-reload--config/app/configuration.yaml

(if hot-reload is being explicitly enabled)

OR

/dist/router--config/app/configuration.yaml

(hot-reload is not being explicitly enabled)

(you can clean up your pod packaging later if you like with apt remove procps.)

If you could update the issue with your output that would help narrow down where any potential problem lies.

abernix commented 1 year ago

Thanks for opening this really well explained issue! Based on the description, I'd say this is currently working as designed. I do understand what you're trying to accomplish though — it makes sense — and in that regard, this suggestion from @bnjjj above is (currently) spot on and the currently-recommended approach for the behavior you're looking for:

a solution could be to download your schema with rover and then run the router with the supergraph schema as a file with -s to pass the supergraph schema file or via the env variable then it won't reload when your schema change on studio

Today, the Managed Federation feature (roughly described here) is designed — at its core — to deliver schemas to Routers while avoiding the need to manage orchestrating rolling them over — allowing them to stay online, and in theory benefiting from still-warm caches of reusable aspects of that graph, but mostly to avoid the extra configuration that some users try to avoid. Specifically, it ships those updated supergraphs whena GraphOS Launch has completed without errors. Of course, that is somewhat incompatible for your desired mode of operation here — which again I understand the value of — though it does work for a majority of our customer's workflows.

Put another way, "Hot reload" as a configuration option isn't meant to change the behavior of managed federation, but instead apply to the supergraph file and the configuration file for the Router. (In a few ways, it's great for local development.)

I know some folks here at Apollo have talked about increasing the atomicity/reproducible build aspect of this though — which might build on top of the existing launches feature — and I could expect more on that in the future. If you're interested, I could get you in touch with the right people!

marc-barry commented 1 year ago

If you could update the issue with your output that would help narrow down where any potential problem lies.

@garypen

root@apollo-router-7fd44dcc9b-z7rfj:/dist# cat /proc/1/cmdline
/dist/router--config/app/configuration.yaml

This is what all of our pods have for cmdline.

marc-barry commented 1 year ago

Thanks for opening this really well explained issue! Based on the description, I'd say this is currently working as designed. I do understand what you're trying to accomplish though — it makes sense — and in that regard, this suggestion from @bnjjj above is (currently) spot on and the currently-recommended approach for the behavior you're looking for:

@abernix Thank you for the clear explanation. I guess I don't really understand what happens when hot reload is enabled. I guess we are using managed federation without hot reload which means that we won't reap some of the benefits you mentioned in your explanation.

garypen commented 1 year ago

If you could update the issue with your output that would help narrow down where any potential problem lies.

@garypen

root@apollo-router-7fd44dcc9b-z7rfj:/dist# cat /proc/1/cmdline
/dist/router--config/app/configuration.yaml

This is what all of our pods have for cmdline.

@marc-barry Thanks for the information. Ok, --hot-reload isn't enabled, but you are getting the schema updates (which, perhaps confusingly, aren't controlled by the --hot-reload flag). The --hot-reload flag is watching for configuration changes.

I'm going to rename this issue to we are unable to disable schema refresh..., since it more accurately reflects the actual issue.

BrynCooke commented 1 year ago

I think we should consider adding the option to have the router download the schema once and only once during startup. Having the ability to prevent reloads is a dealbreaker for some, and it would be better if they were still able to use managed federation rather than jumping through hoops to achieve this.