Apicurio / apicurio-registry

An API/Schema registry - stores APIs and Schemas.
https://www.apicur.io/registry/
Apache License 2.0
609 stars 269 forks source link

Guidance on running multiple instances? #751

Closed forsberg closed 1 year ago

forsberg commented 4 years ago

Can't find any user forum, so sorry for misusing a Github issue. Please feel free to tell me where I should ask questions instead of creating github issues.

I'm evaluating the Apicurio Registry, primarily for use in a streaming system with Kafka and Avro Schemas, but also for its ability to handle AsyncAPI, Protobuf and Json schemas.

As I already have a Kafka setup, I would like to use Kafka as the storage backend. This opens up some questions that I can't really find answers to in the documentation:

  1. I would like to run multiple instances of the Registry, primarily for resillience against failure in the underlying infrastructure (i.e node failure in Kubernetes). Is there any special configuration needed for this, or is it just a matter of starting multiple instances and let a loadbalancer (Kubernetes service) spread the load between them? What about Kafka Consumer group id, should it be set, or not configured?

  2. What are the main pros and cons of the Kafka vs. the Kafka Streams storage backend?

  3. When using the Kafka Streams storage backend, I assume the instances (when multiple) need to be able to talk to eachother on the application server port?

  4. I also have a need to run the schema registry in multiple Kubernetes clusters, running in different datacenters. It's OK if only one of the datacenters have update abilities, the rest may be read-only. What is the recommended approach for this? My main usecase for now is to support one production cluster and one development cluster, so I'm pondering running an instance of the schema registry in the development cluster that talk to the production Kafka as backend. Will this work? Will it work only with the Kafka backend, or can it also be made to work with the streams backend?

antonmry commented 4 years ago

Based on https://github.com/Apicurio/apicurio-registry/issues/723, it seems the Kafka storage is going to be deprecated in favor of Kafka Streams.

Regarding the read-only version, there is some info here: https://github.com/Apicurio/apicurio-registry/issues/637 but it would be great to have a more clear documentation about it, it's a typical scenario.

EricWittmann commented 4 years ago

Thanks for the question, @forsberg - you are not abusing GH issues at all! Sorry for the delay but I'm still recovering from a hurricane that hit last week so am way behind. I'll do my best to answer your questions here. Additional insights could be provided by @alesj , @jsenko , and @carlesarnal

  1. Yes - you can simply start up multiple pods and load balance between them - this is true for all the storages. For specifics around configuring the Kafka Streams storage, here are some helpful (hopefully) links:

That second link is the source code for our documentation - normally I would link you to the published documentation but we don't have that set up yet. Coming soon.

  1. As @antonmry correctly mentioned - the Kafka storage is deprecated and should not be used.

  2. Yes - the pods will need to communicate with each other when using Kafka Streams. I forget the technical details about how precisely they do that. @alesj could provide better clarity on that (i.e. required ports if any)

  3. We're working on a read-only mode but have not implemented that yet. If you want to share a single Kafka cluster for multiple Apicurio Registry installations, I believe you could do that by customizing the Kafka topic names (and perhaps application id) they use.

forsberg commented 4 years ago

Thanks for your answers @EricWittmann. I have successfully configured the registry in a Proof of Concept setup in my development Kubernetes cluster, using a Helm chart I made myself. It authenticates to Kafka using SASL and I can access the API and the UI via port-forwarding, but due to #513 I can't access the UI via my ingress.

Regarding 4, my main use case is that I want applications in my development Kubernetes cluster to be able to talk to a Schema registry using (roughly) the same settings as they will use in the production Kubernetes cluster. I.e, no different authentication and preferrably the same URL (something like http://schema-registry.apicurio, using a cluster-internal address).

So I see the following options:

A) Have a registry instance running in the development cluster, talking to the production Kafka (running in the production cluster). This could work, but I think there may be problems using the streaming backend, at least if using the same application ID as in production, as I then assume the instances in the production cluster would like to talk to some Kafka Streams port for the instance in the development cluster, and that won't work.

I'm a bit new on how Kafka Streams applications work, perhaps I could have the instances in the production cluster and the instance in the development cluster consume the same global-ID and storage topics, but use different application IDs?

The development cluster instance should be read-only, updates will be made against the production instance, or the global schema IDs will be.. not so global.

B) Use mirrormaker to mirror the global-ID and storage topics to the development Kafka in the development Kubernetes cluster. That ought to work, it seems #637 has some hints on doing this.

C) Configure an http proxy in the development cluster, forwarding requests to the production cluster, adding the required authentication on the fly.

Option A may be simplest to implement, if it works.

EricWittmann commented 4 years ago

Some really interesting information here. Can you describe the workflow you envision a little better? Is the goal to only allow changes in the production registry and have the development instance be read-only? What is the workflow of your developers? When/how are schemas added to the registry?

forsberg commented 4 years ago

Well, I'm making this up as I'm building the system, so my vision is perhaps not 100% clear.

The setup I have is that we run one development environment, and one production environment. Each of the environments have their own Kubernetes cluster, and their own Kafka cluster - managed by Strimzi Operator.

The system handles IoT data. Right now, the devices don't send avro-formatted data, so I'm right now building a service that takes the incoming data and convert it into Avro datums, with schema registered in Schema registry. I envision the devices sending Avro-formatted datums in the future, avoiding the conversion step.

The development cluster right now receives the full stream of data from the devices, and will continue to do so in the future, although we may filter a bit so we get 1/5th of the devices or something, to save on cost for the development cluster.

We have extensive CI/CD integration with gitlab, and at least the web components are using review apps, where every branch gets their own environment in the development cluster where changes can be tested. I will try to make this happen for streaming processing apps as well, this is quite easy to do with Strimzi operator being responsible for creation/deletion of topics and users, and installation of the apps via Helm charts - so creating a topic with a unique name is easily done.

Since the production cluster and the development cluster share the same input data, and especially given the vision of having the IoT devices deliver Avro datums, keeping the global schema ID global (or, at least "global within the organisation") becomes very important. I think having the schema IDs differ between the production and development clusters will lead to making mistakes. This is why I envision having a read-only registry in the development cluster that is just a mirror of the data in the production cluster.

When it comes to adding schemas, the nirvana scenario is that only the CI/CD process adds schemas to the production subjects in the schema registry. In fact, programs that only produce should not have to contact the schema registry to acquire the schema ID for serialization, they should retrieve it at build time. In practice, it's probably more convenient to have them retrieve the schema ID at startup, but they should not need write-permissions to do so, as the schema should already be registered as part of the CI/CD process - this also ensures schema validity before starting the application.

I realize however that only being able to work on schemas during CI/CD will be annoying for developers, so developers should be able to add schemas to the production schema registry, but only to development subjects/topics - so that's some input for the authentication/authorization discussion in #743.

So that's my line of thought right now. I will happily accept input on whatever best practices exist for this kind of workflow, and any problems you see with the above.

alesj commented 4 years ago

You have a k8s config example here:

alesj commented 4 years ago

Streams storage distributes data/artifacts via Kafka Streams (doh :-)) -- it's all nicely hidden in topology impl/config. But the (remote) lookup needs to be custom -- as Kafka Streams only provides in-memory/local impl. For this we use gRPC, so you need to expose this APPLICATION_SERVER_HOST/PORT -- so how it's done in the example above. This is due to the gRPC server running there.