[RFC] H-release Rearchitecture

Overview
- Teams
- Breakdown
OpenShift / Kubernetes
Performance
- Bundler Groups
- Additional Potential Savings
Queue
- manageiq-messaging gem
- Ongoing Challenges
Providers
Automate
- Launching methods and DRb

Overview

The ManageIQ team underwent a rearchitecture investigation during the summer of 2017.

The primary reason for us beginning this investigation was due to an increasing number of customer escalations. In looking at these customer escalations we found that many of them were consistent in that they were problems of scalability and performance. The problems were almost always in the form of "Inventory collections are delayed or hit timeout errors"; "Metrics collections can’t keep up"; "Too many queue messages bog down the system". Of greatest concern was that if these escalations keep increasing at the current pace, then it will only get worse as we become distributed with other platforms.

The secondary reason for us doing the investigation was a need to embrace OpenShift/Kubernetes as a platform. In particular, we can manage OpenShift/Kubernetes, but we can’t run on it, which is confusing to users as they have a PaaS, but still need to deploy a virtual appliance.

In order to tackle these challenges, we set out to investigate what major changes would be required by our product to run on OpenShift/Kubernetes, and once there, what features of it could we leverage to tackle the scalability and performance problems. Additionally, since Kubernetes gives us the ability to play with numerous technologies, we would take this opportunity to try out new technologies and see what those technologies can do for us, particularly in replacing the home-grown things we’ve built over the last 10 years.

Teams

The approach was to take a number of developers, break them down into teams, and those teams could deep-dive into their area of expertise, with daily standups and demos to keep all parties in sync.

OpenShift/Kubernetes team, focused on running workers and dependent technologies as containers, and deployed on OpenShift/Kubernetes. Additionally investigated how this would affect the current deployment mode of virtual machines, as well as how it would affect internal constructs like Zones.
Authentication team, focused on dealing with external authentication mechanisms, the higher level of permissions they require, and trying to keep those separate from our application, such that they can be reused with other container-based applications.
Performance team, focused on more that could be done to minimize memory of the existing worker architecture, and how to leverage OpenShift to get even more out of the workers or eliminate them altogether.
Queue team, focused on understanding the current problems with our home-grown, PostgreSQL-backed MiqQueue, and investigating replacement technologies that cover as many of our existing use cases as possible.
Provider Inventory team, focused on how we collect inventory from providers, inventing new techniques for collection, and leveraging new technologies and languages to improve the scalability of collection.
Provider Events team, focused on the scalability issues with event collections and event handling.
Provider Metrics team, focused on both the backend technologies for metrics storage and retrieval, as well as how we schedule and perform the collection of metrics.
Automate team, focused on changes to automate required to run on OpenShift, and what features would be available to us.

Breakdown

Prior to beginning their efforts, the transition to running on OpenShift was broken down into 4 phases.

Phase 0 - Monolithic container (already done for ManageIQ Fine)
Phase 1 - Less monolithic container without the need for privileged (required for ManageIQ Gaprindashvili)
Phase 2 - Each ManageIQ worker as a separate container, but using the same underlying image (ManageIQ H-release)
Phase 3 - Each ManageIQ worker as a separate image, potentially exposing a dedicated micro-service if the workload demands it. (ManageIQ I-release+)

The teams investigated Phases 1-3 with varying degrees of attention. For ManageIQ Gaprindashvili, there are some important deliverables, so in trying to work on Phase 2, they paid special attention to how the changes could be implemented in Phase 1 for reuse in Phase 2. They also thought ahead to how a "Bring Your Own Image" and microservice world might look like in Phase 3 in their designs for Phase 2.

OpenShift / Kubernetes

The team developed a proof of concept that ran various parts of our application as pods on OpenShift. While the ultimate goal is to run on Kubernetes, some OpenShift specific features were leveraged in the PoC. However upon deeper analysis, we believe we won't need those OpenShift specific features and can run fully on Kubernetes. Running on Kubernetes is preferable as it allows us the opportunity to promote ManageIQ to a much wider audience.

Orchestration

There will be a primary pod known as the ManageIQ Orchestrator. The purpose of this pod is similar to our current evmserverd process, and initially will be the same code as the evmserverd process. The evmserverd process is aware of the state of workers, knows to spin up/down the number of workers, watches heartbeats for liveness killing them as appropriate, and can also watch for CPU and Memory thresholds. These are all abilities of OpenShift, so marrying evmserverd with the OpenShift API, will allow us to leverage OpenShift and let them do what they do best.

Our new ManageIQ Orchestrator communicates with the OpenShift API, dynamically deploying worker pods, and scaling them up or down based on user changes in the ManageIQ UI. Eventually, it would be preferable to autoscale the workers based on some metrics such as number of requests or queue depth, thereby removing that burden from the administrator, but for now we will leverage the existing code to manually set worker counts.

The Orchestrator will also be responsible for launching dependent services. Dependent services are components of the architecture that are shared by all components, such as the PostgreSQL database pod, memcached pod, and others. These dependent service pods will have an OpenShift Service in front on them so they can be internally routed to.

Workers

There are 3 categories of workers: service workers, shared workers, and provider specific workers. These workers will run as separate OpenShift Deployments that are dynamically requested by the Orchestrator

Service workers are workers that need to be routed to and thus need to be load balanced behind an OpenShift Service. These include the UI worker, API worker, and Websocket worker. Each one is a separate deployment that can be scaled independently. The user’s path to these workers starts at the external OpenShift route, which will accept incoming connections on port 443, handling the SSL negotiation. This traffic will then pass through the external auth container for external authentication, which is described in more detail below [→]. Then, based on the incoming URL, the auth container will route to the appropriate Service for UI, API, or Websockets, and the Service will handle load balancing across the workers of that type. Additionally, the Orchestrator will deal with role-enabled service workers, such as the EmbeddedAnsible worker. This type of worker is only deployed if the corresponding role is also enabled.

Shared workers are workers that do not need to be routed to and thus don’t need a load balancing Service. They are the core workers of the ManageIQ platform. There are 2 types of shared workers: regular and provider-enabled. The regular workers include Generic, Priority, Reporting, and Schedule workers. These will work nearly the same as they do now, and can be scaled up/down as needed. Provider-enabled shared workers, known as "persisters", will only be deployed if a provider has been configured. These "persisters" will be described in more detail in the next part.

Provider specific workers come into play when a provider has been configured, and when that happens the Orchestrator will start a number of "collector" workers, "helper" workers, and shared "persistor" workers.

First, the Orchestrator will start a number of provider specific "collector" workers, handling inventory collection, metrics collection (if available), and event collection (if available). These "collector" workers will speak to the provider directly, collecting their information and placing them in a well-defined format in the new messaging system. Although the workers will start their lives as mostly the same Ruby code as they are now, they are actually decoupled from the ManageIQ application, and ultimately can be written in whatever language is best for that provider, and run in whatever image environment they need. This is what we are calling "Bring Your Own Image" [→].

Additionally, if needed, the Orchestrator will start a number of provider specific "helpers". These include things like the existing VimBrokerWorker for VMware, or perhaps a future native-operations microservice.

Finally, the Orchestrator will start a number of shared workers called "persisters". These persisters are responsible for watching the queues/topics for incoming data and persisting that data to the database. Since the data from the queue will be in a well-defined format, these persisters can be provider agnostic, and so they will be shared across all providers. The inventory persistor can leverage a new stream-based, partial update refresh strategy that can update the database in a more real-time fashion. The old refresh strategies will still be available for providers that can’t take advantage of this new strategy.

Authentication

The external authentication mechanisms will be extracted into a dedicated image that will act as a middleman between the external facing route and our internal worker Services. One important goal of the external auth image is that it will be application independent allowing for its reuse with any other container-based product running on OpenShift.

Configuration of the external auth container will be done with two OpenShift ConfigMaps. One ConfigMap is for the external auth configuration itself, including support for IPA, Active Directory, LDAP, and SAML / Keycloak. This ConfigMap will be generated by a separate "helper" container that can be run by the customer directly with an interactive script or interface, helping them set up their configuration. The generated ConfigMap can then be fed into the external auth pod.

The second ConfigMap is the application specific config map that will allow the developers to inject their own httpd conf file detailing the RewriteRules and RewriteConds to where the traffic should be routed within the project. For example, in ManageIQ, we will inject our application’s rules to route to the UI, API, and Websockets Services based on the URL.

The external auth image will also expose a small microservice to facilitate DBUS queries. This allows the application to make queries back to the pod for detailed group information of a particular user, should that information be necessary.

The external auth image will not be concerned with SSL traffic and certificates as that will be handled by the OpenShift Route as mentioned previously.

Note that since the external auth image will be running systemd internally (a requirement for SSSD), it does require the anyuid privilege in OpenShift. On versions of OpenShift that do not have the oci-systemd-hook enabled, such as MiniShift, then an additional sysadmin privilege will be needed.

MiqServers and Zones

In the old appliance model, each appliance was 1-to-1 mapped to an MiqServer, and a Zone was defined as a set of MiqServers. Zones would be used for grouping the MiqServers for various purposes. Combined with the ability to map Providers into a Zone, these purposes can be summarized as follows:

Placing work near or in a place that has connectivity to an external resource. (e.g. An inventory refresher near the VMware provider, or Amazon workers on an appliance that has Internet access)
Placing workers near an internal resource (e.g. UI workers close to the database for low latency)
RFE: Place automate work near a resource (same placement as above, but with each automate method)

In the new OpenShift model, we can think of all of our nodes as one giant expanse of compute, and thus like one giant MiqServer. Thus, the MiqServer concept is no longer necessary and if MiqServer goes away, then the concept of grouping them together in Zones also goes away. By removing the "grouping" aspect of Zones, we can focus on the deeper underlying use of Zones which is for affinity to resources. OpenShift handles affinities by using Labels and Selectors, and we can leverage these to implement Zones.

To achieve the same affinities as previously desired, the user can label their OpenShift nodes with "zone_=true", and configure those nodes to have network connectivity to the external resource. These same zone names would be created in ManageIQ, and the providers would be mapped to the zones as is done now. The Orchestrator, aware that the provider is zone-restricted, would apply that zone selector when dynamically launching the workers, and those workers would be scheduled to run only on those nodes.

One potential downside to this approach is that shared workers, such as generic workers cannot be run with specific selectors as their replicas are automatically handled by OpenShift and they do not have an identity with which we could apply the selector. This means that we cannot have specific generic work items routed to a zone. However, in analyzing all of the callers of the MiqQueue it was found that all of the usages of zone fell into 1 of 3 categories which can all be handled with some code changes.

EMS Operations - These types of work items are arguably provider-specific, and combined with "Bring Your Own Image", would require us to not run them on the generic worker. So, the proposal is to create a new provider-specific worker type for native provider operations that would be dynamically deployed like the other provider-specific workers.
Provisioning, SmartState, MiqTasks - All of these are complex state machines which have states that require native provider operations. They are launched in the specific zone, so when that state is reached the work will be in the right place. These should be split apart such that the native operation is handled by the native EMS operation worker described above, and the non-provider specific work is then made zone-less such that it can run anywhere.
Automate (directly and also via MiqAlert callbacks) - Presently, automate methods are loosely made zone aware by leveraging the zone routing of the queue so that generic workers in those zones will launch the automate methods. However, this doesn’t really fulfill the RFE to have specific methods run in specific zones. The proposal is to allow automate methods to have a zone value. The zone value can be statically defined in the method definition, or it can be dynamically computed by the automate model. (e.g. a common RHV operation could be written as an automate method, and the zone could be computed based on which provider instance the method is evaluated against). Then a dedicated automate worker (described more in detail later [→]), with the ability to launch automate methods as deployments via the OpenShift API could apply the appropriate Selector.

Additionally, there is one huge advantage to implementing Zones in this fashion. In the current model, since a Zone doubles as a "grouping" of MiqServers, there is no way to have an MiqServer in two different zones. In the new model, the same OpenShift node can be labeled with multiple zone labels. This means that depending on the network topology, this could reduce a significant number of appliances/nodes that would otherwise be necessary by allowing the customer to "double up" where the networks are shared.

Settings

An interesting side effect of removing the MiqServer concept (and by extension, the "grouping" aspect of Zones), is that Zone-level and MiqServer-level settings are no longer needed. (Note that this section is conceptual, and has not been vetted as part of the rearchitecture efforts)

We will still need the layering aspect to handle productization and environment specific settings. These are driven through files and would be part of the base container image. Changes to the configuration are currently stored in the database in the settings_changes table, and this can continue in the short term, however, the original purpose of that table was for the MiqServer changes, and with those removed, we will not need the settings_changes table in its present form, or perhaps at all.

One possible replacement for settings_changes is to just store the changes in a single file, and persist it somewhere. One option is to do this through a settings microservice, so that it can be the single source of truth for the "current" settings. Alternately, we could use a ConfigMap, but that might force redeployments and would be difficult to change via our UI.

Replication

Multi-Region replication is currently implemented using the pglogical service within PostgreSQL. pglogical uses logical replication over the same 5432 port that regular database traffic uses. Thus, the only requirement to make replication work is to be able to expose that port outside of the OpenShift cluster. There are a number of ways to expose services externally¹, and the user can choose which one works best for them.

By default, we will use the Ingress IP Self-Service² method. This works via a configuration on the OpenShift Service for the PostgreSQL pod. When set, this tells OpenShift to open a random port in the 30k-60k range on all nodes in the cluster, and map that port to the internal Service’s port. When configuring replication in ManageIQ, all that needs to be done on the global region is to use the hostname of any node in the cluster, and that random port number.

Memcached and Sessions

In the appliance model we ship with memcached as an efficient memory-based session storage for the UI. However, since memcached runs on each appliance, sessions must be made sticky to a particular appliance, otherwise if traffic is routed to a different UI worker on a different appliance, the session will be lost and the user logged out. To complicate the matter further, if the customer puts the multiple UI appliances behind an external load balancer, there is no way for us to make the sessions sticky, and so they are forced to switch the session storage to database-backed storage.

In the new OpenShift model, we can ensure that there is one, shared, memcached instance for the entire region. This completely eliminates the need for sticky sessions, and simultaneously eliminates the need to maintain the code the database-based sessions storage, greatly simplifying the entire problem.

Additionally, leveraging OpenShift allows us to experiment with new technologies. For example, the UI team has demonstrated that when running on OpenShift we can replace the backend storage for ActiveJob connections with Redis, which has full upstream Rails support, instead of using the current PostgreSQL backend, which has very little upstream Rails support. Then, once we have Redis as a dependent service, that could be used as a replacement for memcached.

Dual Deployment

Even with the ability to run natively in OpenShift, we still have customers that will be using VMware, RHV, etc, and want to deploy with the appliance model. From a development perspective, it is a costly overhead to maintain running on multiple platforms. Dual Deployment is the term for changing the way the appliance works such that each appliance is an OpenShift node inside. This way, we eliminate the overhead of ManageIQ running on multiple platforms, and can focus on running only on OpenShift. The customer however, should not need to be aware that they are using OpenShift under the covers, nor need them to become familiar with the various OpenShift terms and administration procedures.

Leveraging the openshift-ansible installer, we will install OpenShift inside a virtual image with a RHEL Atomic base. The image can be pre-populated with all of the images for the workers, or alternatively, we could have a smaller virtual appliance by not including the images, and they could be fetched on first use.

The first instantiation of the appliance would be the OpenShift master node. Subsequent virtual appliances would join the cluster as a regular node. In order to make this simpler for customers, we will need an interface similar to the existing appliance_console, but where the user can choose to create the first master node, or join the cluster. Additionally, the new console will need a way to allow setting zone labels on the node. One possible implementation could be to run Cockpit on the appliance, and implement this interface as a Cockpit module. (Note that the details in this paragraph are conceptual and have not been vetted as part of the rearchitecture efforts)

Service Providers

An important use case to consider is how service providers can provide a seamless experience to their customers without resorting to multiple regions. (Note that this section is conceptual, and has not been vetted as part of the rearchitecture efforts)

Presently, there are a number of constructs within ManageIQ that aid in supporting this, including tenants and appliance-specific branding, and appliance-specific external authorization. One example implementation in the current model is to set up a UI Zone for each tenant, apply appliance-specific branding and external auth configuration to each appliance in that Zone, create an external LoadBalancer in front of those UI appliances, and then set up the DNS entry for that sub-customer to point to that LoadBalancer. This requires a lot of extra appliances, and external resources, and is a pain to maintain.

In the new OpenShift model, the MiqServers will no longer exist, but that is where the appliance-specific branding is stored. Instead, we will need a different way to store branding, so we can leverage the tenancy structure, which already has support for branding. With the addition of a mapping between incoming hostname and a tenant, we can provide a tenant-specific login screen, and once a user logs in, we know their tenant and can show the appropriate application-level branding. For example, if company-a.example.com is mapped to TenantA and company-b.example.com is mapped to TenantB, we would show the appropriate branding based on the hostname of the incoming request and/or the user’s tenant association.

If separate SSL certificates are required for different hostnames, that can be implemented by having separate OpenShift Routes. The DNS entries would then route to the correct Route, and the routes would map to the same external authentication pod.

If separate external authentication is needed, then we can spin up a separate external auth deployment for each tenant, but map the outgoing traffic to the same internal UI, API, and Websocket Services.

Leveraging OpenShift will allow us to have a much simpler Service Provider experience, potentially eliminating a lot of extra appliances and workers that would otherwise be duplicated.

Log Collection

For diagnostic purposes and bug triaging, log collection is of utmost importance, however in the OpenShift model where pods are coming and going, it would be very difficult to collect log files on demand, not to mention that after some time those files are no longer saved for terminated pods. This is where the Red Hat Common Logging initiative will help us.

Common Logging is implemented using the EFK stack (ElasticSearch, fluentd, Kibana). Fluentd is configured to ingest anything written to STDOUT by the various pods and write that information into ElasticSearch. If the output written to STDOUT is in JSON with specific keys, then that output can be parsed and put into ElasticSearch in a much more discoverable way. Kibana can then be used to look at the data stored in ElasticSearch so a user can visualize and search.

For the purposes of offline log inspection, however, we would need these logs exported. This can be done using a tool called elasticdump, which will export the Elastic database. That export can be transferred to us, where we can import it and inspect it with Kibana,

For the Phase 1 implementation, we have already changed our logs to write to STDOUT in a well defined JSON format, so Common Logging is already able to be used if it is present. For Phase 2+ we would need to ensure that either OpenShift comes with Common Logging enabled, or we would need to launch those services ourselves.

Database Backup and Restore

In the appliance model, database backup and restore are part of the responsibilities of the appliance, but in OpenShift we will not have access to the underlying storage of the PostgreSQL database from the workers nor from the new appliance_console. Instead of the application being responsible for backup and restore, we will transfer this responsibility to the customer via one of two methods. If the customer is using an underlying storage system that supports storage snapshots, then they can use those. Alternately, a dedicated container run as an OpenShift Job, will connect to the database pod and perform a full binary backup of the entire database cluster, based on pg_basebackup. For Phase 1 implementation, this has already been documented³.

In the Dual Deployment appliance model, we will need to automate this via the new "appliance_console". (Note that this paragraph is conceptual, and has not been vetted as part of the rearchitecture efforts)

Upgrades

The ManageIQ Orchestrator is responsible for starting and stopping all of the workers and services, so an upgrade is a matter of just updating the Orchestrator. Once the Orchestrator is updated and redeployed it could bring down all running instances and relaunch them. Database migrations are always run by the Orchestrator on startup. (Note that this section is conceptual, and has not been vetted as part of the rearchitecture efforts)

Performance

The performance team focused on what could be done with the existing workers to lower the memory usage, and thus allow us to streamline our workers better. They did not focus on what kind of savings we would get just switching to the OpenShift model alone, but we are expecting some savings that will come from:

Fewer appliances / nodes being deployed because of changes as described in the Zone section above.
When providers bring their own images, they can write much more efficient collectors, eliminating the need to carry the burden of the Ruby code of all other providers.
Redundant services, such as running memcached on every appliance will be dropped down to only one of those running for the entire region.
Redundant workers, such as multiple copies of the Scheduler running on multiple appliances can be eliminated, as there can be one for the entire region.
Switching to launching workers as pods prevents us from leveraging Copy-on-write via fork, however we learned that Ruby is not very Copy-on-write friendly. As such, all forked workers in a short period of time would copy their parent’s memory almost entirely thus nullifying the benefits. Even worse, we found that if the parent process was inordinately large, the child would copy nearly all of that, making each child also inordinately large. Launching the workers as pods will prevent this case from happening, so may be overall beneficial.

Bundler Groups

Bundler groups are a way of categorizing the various rubygems that comprise the application, and are defined in the Gemfile. When gems are categorized by feature, then we can specify which groups to enable or disable when launching a worker, significantly reducing the amount of code loaded, memory used, and overall boot time of a worker. For the workers with the most savings, we found we could drop the worker baseline memory usage from ~144MB to ~100MB or ~30.5%. Additionally, less memory means less time spent in GC in Ruby, meaning the Ruby processes are more efficient.

Additional Potential Savings

~10 MB in tightly coupled gems, such as the API
~30 MB in UI dependencies, which would be limited to UI pods
10-15 MB saved by upgrading to ruby 2.4.1
5-15 MB reduction is possible, but not simple, with the removal of things like gettext (translations) in all processes, even when they’re not needed.

Queue

The Queue team focused on our use of the MiqQueue from two perspectives. One was examining how we actually use the MiqQueue, and the other was investigating a number of messaging systems to see which one fulfilled as many of our use cases as possible. ActiveMQ Artemis was ultimately chosen as the new successor to the MiqQueue.

manageiq-messaging gem

The team built an abstraction layer gem over the ActiveMQ Artemis client libraries, named manageiq-messaging. In doing so they created a simple API, which will simplify the transition away from the MiqQueue.

The gem also creates abstractions for 3 different use cases: background jobs, queue messages, and topics (aka pub/sub).

Background Jobs are very similar to the current MiqQueue style where the producer defines the exact class, method and args. A worker subscribing to background jobs, such as the generic worker, does exactly what it’s told to do, so the logic for "how" to do the work in on the producer side.

Queue messages are slightly different than Background Jobs in that they represent a request for action, but without specifying how to accomplish the request. The "how" to do the work is implemented on the subscriber side instead.

For example, with background jobs, if someone clicks the start button on a VM in the UI, you would put the specific provider’s class name, and the "start" method, on the queue. An operations worker might pick up that work and run it exactly as described. However, with queue messages, we instead just put a "start" message with the provider id into a general queue or into a provider specific queue. The operations workers could then watch that queue filtering out messages it doesn’t care about and only handling the messages it does care about. The logic for "how" to handle the message is in the operation worker. This distinction is important as it can allow the platform (the producer side) to be more provider agnostic, leaving the details to provider-specific workers (the subscriber side).

The third use case is topics (aka pub/sub). Topics are similar to an event stream where a producer just emits events into the topic, and multiple subscribers can act upon some or all of the messages it sees in that topic. These are very useful as the channel between provider-specific collectors and provider-agnostic persisters as described previously, and described in more detail later [→].

Ongoing Challenges

put_unless_exists; put_or_update - These describe modifications and/or inspections of existing queue items, which will not be possible with an alternative messaging system. Many of these would have to be handled by being rewritten as state machine based tasks, or handled in entirely new ways. For the ones that are left after rewrite of the provider interactions and feature elimination, we will need to solve these one-by-one.
role, queue, server_guid - Along with zone, these columns make up the various routing elements of the existing MiqQueue, which will need to be collapsed or removed, and ultimately be turned into just a queue name within ActiveMQ Artemis
zone, as described earlier will need to be removed and replaced with dedicated automate and ems_operations workers.

Providers

In general, the greatest performance bottleneck for nearly all aspects of providers is the usage of the MiqQueue. So, the focus of the providers team was primarily finding ways to change how collectors work so that they wouldn’t use the MiqQueue anymore, or if a messaging system was still needed, leverage features from that new system instead. Additionally, some aspects of the code organization lead to process bloat, which has been addressed in the new designs.

Collector / Persister split

A common pattern for the rearchitecture of providers is what we are calling the "Collector / Persister split". This refer to the separation of native-side collectors and platform-side persisters. In the current application, the collector/persister split only exists for events, with the MiqQueue as the intermediary. For inventory and metrics, collection and persistence happen in the same process. Being in the same process tends to bloat the process because it needs to carry both the code and data for both sides of the equation.

Bring Your Own Image

An important goal we are striving for is "Bring Your Own Image". Allowing provider authors to have code written in the language they find best allows them to focus on the task at hand, allowing them to use the most optimized languages and technology. Typically, the provider’s client library that is most kept up-to-date is the one written in the native language, so allowing the developers to write in that language allows them to keep as up-to-date as possible. Additionally, as ManageIQ grows, we want it to be the de-facto platform for management, and one important way to accomplish this is to ensure that we are aligned with the upstream communities of the providers we manage. "Bring Your Own Image" helps that effort by keeping the code consistent with the code of the provider, thus allowing any direct contributor to the upstream provider to be a contributor to the management plugin. For example, authors of a Go-native provider don’t have to learn Ruby to contribute to ManageIQ, but can write their provider plugin in Go, using their Go-native client library.

Events

Since events mostly subscribe to the collector / persister split already, the focus was on eliminating the MiqQueue. The MiqQueue becomes a bottleneck because it has a difficult time handling a large stream of events. A heavily used provider, or a flood of events (aka an "event storm"), can bog down the MiqQueue itself, which in turn bogs down the entire application.

The proposal is to replace the MiqQueue directly with ActiveMQ Artemis topics. Collectors (which can be written in their own language) will capture the events, and write the data directly to the ActiveMQ Artemis topic. From there, multiple subscribers can read from that topic, handling the events as appropriate. One subscriber will be the platform-side EventHandler (renamed to EventPersister), which will watch the topic for events it cares about, and writing them to the database for reporting purposes in timelines. It will ignore events it doesn’t care about, even those that won’t ever appear on the timeline. A second platform-level subscriber, named the AutomateEventHandler, will watch the topic for events that the user has written automate handlers for, and react to only those events directly. Providers themselves may choose to write provider-level subscribers, such as an inventory collector driven off of events instead of polling. By subscribing to the events, they can use the data stored in the events to do a more efficient inventory collection.

Inventory

For inventory, some of the providers subscribe to a collector/persister split at the source level, but the code is still running in the same process. This causes massive code bloat because typically the process collects a lot of data, keeping it all in memory, and can’t free that memory until the persister has completed. The persister needs to make a "copy" of that data for the purposes of writing to the database, and thus you end up with duplicate copies of the data in memory. From a scalability perspective this is a huge problem for very large providers, in particular public images of a cloud or container provider.

The proposal to solve these problems is to enforce the collector / persister split. Collectors (which can be written in their own language), are responsible for collecting the data from the provider and placing them into a ActiveMQ Artemis queue in a well-defined format. On the platform-side, persister processes will watch the queues and write that data to the database. This keeps the processes separate, keeping memory levels stable.

Another problem on the inventory side is the usage of the MiqQueue to communicate requests for updates to the refresher process. The inventory code, in order to prevent duplicate requests in the queue, and to "rollup" requests for the same provider, modifies existing queue items to "add" provider requests to them. As described earlier, the MiqQueue feature of being able to modify entries is a major cause of problems in the queue, and is also not available in a new messaging system.

To avoid the queue manipulations, collectors will be responsible for having the knowledge of what to collect, but more importantly, when to collect it. Every provider has drastically different mechanisms for knowing when to collect. Some rely on events, some have callback mechanisms, and others have nothing except for constant polling. For example, in the VMware provider, we have the WaitForUpdates method which allows a callback with the exact inventory changes, and those can pretty much be written directly to the database, but we can’t use it. Instead, we use events, and part of event handling is to put a targeted or full refresh request on the queue. This makes it impossible to leverage the super-efficient mechanism already provided by VMware. Leveraging the WaitForUpdates method directly would also significantly drop the collector memory because only a very small amount of changes are in memory at any given time.

A third problem on the inventory side is that most providers can only handle a full refresh, meaning that the entirety of the inventory must be queried for, compared to the database, and the changes written. For the initial refresh, and refreshes where you want to "fix" the data such as on a reboot, this is acceptable, but for regular ongoing updates it is not. We do have the concept of "targeted refresh", but it is complicated for a provider author and only supports a small set of objects, namely Vms + Hosts on Infra and Cloud providers only.

Going back to our VMware example, even if we used WaitForUpdates directly, we can’t update the database because the only refresh strategies we have are full refresh or partial refresh. So, the proposal to solve this is two-fold. First a new graph refresh strategy is to be implemented (development for this had already started prior to the rearchitecture). The graph refresh is a more advanced refresh strategy which can understand partial updates to the inventory, allowing targeting of any object in the inventory data. Second, the provider collector will publish these partial updates to a ActiveMQ Artemis queue, where the persistor will watch for these partial updates and apply them. The data will thus come into the system in a more real-time fashion. In the example VMware provider, the results of WaitForUpdates can be put as a partial update into the queue.

Metrics

The metrics team mostly focused on the backend storage mechanism for metrics, believing it to be the primary bottleneck. As such, much of the time was spent in investigating alternate backends until we realized we have higher-level problems in the application itself.

Storage backend

The current storage mechanism for metrics is the PostgreSQL database. The team researched a number of Time Series Databases (TSDBs), but each one researched seemed to have some problem that gave us pause in choosing that database as the new backend storage mechanism. Additionally, we found that our problems come more from how we read and write data into the current storage mechanism, than from the storage itself.

One of these problems is what we call the "20 second interval" problem. The original metrics implementation was written when VMware was the only provider we supported, and thus many of the decisions made were around VMware’s "realtime" collection interval of 20 seconds. The database storage mechanism itself doesn’t care about 20 second intervals, however the application, on both the reading and writing sides are expecting the data in 20-second intervals. Even if we changed the storage backend, we would still have this problem and would have to account for it anyway. One major problem with this is that provider authors are constrained, and many need to add incredibly complicated code just to manipulate their metrics into 20-second buckets. The only way to solve this is to completely remove the 20-second interval restriction on both the writer’s side and, more importantly, on the reader’s side.

Another problem is the strict schema of our metrics tables. Each row in the table stores the metric value in a column for that timestamp. However, this restricts the metrics to a predefined set of columns. New providers bringing new metrics, or even new metrics in existing providers, require changes to the schema, which makes it more difficult to bring new things to the application. The only way to solve this is to completely replace the strict schema with a more flexible mechanism. This could be implemented by choosing a new storage backend or just changing the existing storage backend to be a jsonb column.

Scheduling of collection

As far as scalability goes, the existing database, while it has some problems that need to be addressed, doesn’t really contribute to the main scalability problem. The bigger problem is in how we schedule metrics collections, and how those collection requests are communicated to the collector workers via the MiqQueue.

In the current architecture, collector workers are not provider-specific, but are instead "shared" workers that do both collection and persistence, leading to multiple problems.

Collectors workers need to be "told" what to work on. This is done via the MiqQueue, and so the collectors look at the queue, one message at a time. Because of this, they can’t leverage any batch collection that the provider might offer.
A centralized scheduler is required, which, on every invocation, must load all of a provider’s potential candidates for collection into memory, in order to filter the list down and put those requests on the MiqQueue. This is an expensive task, which is not scalable, and has led to a number of decisions which affect how frequently the metrics can be collected. Additionally, this has led to a centralized set of filtering rules, regardless of provider type, which also affect how frequently the metrics can be collected.
Since the collectors are shared, they must have knowledge, and thus load the code, for every possible provider, even if that provider is not being used
Even though they are shared, this is actually undesirable in practice, thus introducing a need to separate them in field implementations, which has led to an abuse of the Provider-Zone relationship in order to force the workers to only work on a subset of the requests, which in turn leads to a growth in the number of appliances needed.

The proposed solution to this is to change how metrics collection scheduling works by following the collector / persister split pattern. Collectors (which can be written in their own language), are responsible for collecting the metrics from the provider and writing them into an ActiveMQ Artemis queue. The collector can collect in whatever they deem the most efficient way. More importantly, much like in the inventory changes, it will be up to the collector to determine what and when to collect the data. This eliminates the platform-side scheduling bottlenecks entirely, and allows the provider author to make the decision on how best to determine which metrics to collect and how frequently. Persisters will then read from that ActiveMQ Artemis queue and persist the metrics to the storage backend. Additionally, this could theoretically allow for the implementation of a long-requested RFE where the provider may want to write "directly" into the ManageIQ database. It could theoretically be implemented as a "collector", but one that runs in the provider itself, writing to a well-defined API endpoint, which, in turn, would write to the same ActiveMQ Artemis queue.

ActiveMetrics gem

Much like an abstraction layer was written against ActiveMQ Artemis, there will also be an abstraction layer written against the storage backend, which we have called ActiveMetrics. In order to get our code away from the 20 second intervals and the details on how the records are stored, ActiveMetrics will provide an abstraction to both reading and writing, allowing us to transition to alternate backends.

Automate

A number of challenges have been found with the existing automate infrastructure, which led us to investigate what changes we could make if we could leverage OpenShift.

Launching methods and DRb

In the current application, automate methods are written in Ruby, and are launched in a separate child process. This separate process is important, because it sandboxes the customer written methods (to an extent), and prevents them from modifying the ManageIQ application directly, even accidentally. In order to communicate with the child process we use DRb. DRb (Distributed Ruby) is a Ruby-based inter-process communication layer that uses serialization of Ruby objects. While very useful, it is extremely tricky to manage properly and has led to a number of major escalations over the years, not to mention that memory bloat and performance problems it has introduced. Over time we’ve come to learn that the usage of DRb is major source of problem within our system.

When we evaluate an Automate model, that resolutions results are stored in a "workspace" in the process’ memory. As we walk through the steps of the resolution we continually update that workspace. When we launch an automate method written in Ruby, we establish a DRb connection between that child process and the parent process, so that the child process can also modify the workspace stored in the parent, but also to access any of our objects from the database. However, by leveraging DRb, not only do we have the aforementioned problems, we also cannot support any other method language except Ruby (including the new Ansible playbook methods).

To solve this, in Gaprindashvili, we are creating an alternate way to access the workspace via the API. Before launching an automate method, the parent process will export the in-memory workspace into the database, which can be access via the API. Then, the child process can modify that workspace over the API, and it can also access any information it needs also from the API. When the method is done, the parent process can then look at the changes saved to the database, and update it’s in-memory model accordingly. This further isolates the automate methods, and creates a uniform, cross-language way to communicate with the system.

With the API-based communication in place, we can then leverage OpenShift by having each automate method run as a container. When defining a new automate method, the author would choose the image for that method. A new automate worker would be responsible for talking to the OpenShift API, and launching these automate methods as deployments, on demand. By isolating with containers, we not only provide even stronger sandboxing, but can also allow "Bring Your Own Image" environments for those methods. One problem that automate method authors run into is that they are forced to use not only Ruby, but also whatever Ruby gems are available on the appliance, and modifying the appliance environment is tricky and potentially dangerous. With "Bring Your Own Image", that problem goes away entirely, as the customer can put whatever they want into the image, writing in whatever language they want, and do not have to manipulate the ManageIQ environment.

Additionally, there has been a long-requested RFE to have specific methods run in specific Zones due to the affinity reasons described above. This would now be much easier to implement by having each method define what zone it needs to run in, and with Zones defined as OpenShift Labels, the Automate worker would then launch the automate method’s container with the appropriate selector.

¹ https://docs.openshift.com/container-platform/3.6/dev_guide/getting_traffic_into_cluster.html ² https://docs.openshift.com/container-platform/3.6/dev_guide/getting_traffic_into_cluster.html#using-ingress-IP-self-service ³ https://github.com/ManageIQ/manageiq-pods#backup-and-restore-of-the-miq-database

ManageIQ / manageiq-design