Closed Fryguy closed 4 years ago
Hi all devs! Between the lines I read two truths.
The first is that you have chosen the right way to split the components that used to be a monolithic code and such a splitting will increase flexibility and manageability, which in turn allows you to intelligently manage the identification of bottlenecks and correct architectural problems.
The second is that RedHat is continuing its marketing company to promote OpenShift and uses all possible methods for this. This worries me, because as an infrastructure operator I'm afraid to imagine what changes I need to make in the infrastructure that I support to get a working product (CFME).
Containers allow developers to simplify the process of deploying code, but only for those who do not know how to correctly package the code. For Red Hat this problem has been solved long time ago (by rpms) and I would suggest not playing the toy, discarding the excess and concentrating on writing the code that is inside the workers, but not outside of them, or rather on making the current code in "CFME" finally work.
For such a key provider as "RHOSP" I would suggest creating a distributed test site (~ 1000 tenants, networks and vms) because now the code suffers from a devstack/packstack-syndrome - it is tested on a small developer's laptop. Hence all the related problems arise ... And I can not say that the provider for OpenStack is in working order, for serious service current code is not applicable.
For distributed deployment, RedHat now has a good tool - Ansible. And I would suggest simply to get away from the big binary appliance, split the cfme package to small RPM's (the same as your pods) and describe the infrastructure (specified by you above for OpenShift) as deployment code for installation in ansible-playbok - this will solve the same problems but in a simpler and more acceptable way for the end user, and at the same time it will cause you to implement the conceived refactoring of the architecture without using children's toys (such as containers).
@ITD27M01 interesting feedback, I can see your concern about using kubernetes/openshift, however, just to add my two cents is that openshift and kubernetes do provide a lot of infrastructure that otherwise would require a significant code in manageIQ, further, doing things the kubernetes/openshift way solves many issues with LB/HA and easier means of introducing new / better technology when needed.
you are correct that its more complex, but the nature of the problems are more complex and therefore most likely can't be avoided.
@ohadlevy Thank you for your reply.
I'm not against new technologies if they improve the quality of the code. But so far I see that the current code has a lot of children's errors/bugs and typos. I do not know anything about the current development infrastructure built for the manageiq project but I can assume that it is missing, because every new release manageiq brings me a lot of new bugs. I want to propose to change the direction of engineers' views on improving the quality of the code not by using new progressive tools, but by increasing attention to the development process.
In themselves, progressive tools are logical to use when there is nothing else and they are the only weighted choice. But now the choice of containers is more like as cargo cult, because the code is bad and does not work and it will not work after the infrastructure will be transfered into containers.
Overview
The ManageIQ team underwent a rearchitecture investigation during the summer of 2017.
The primary reason for us beginning this investigation was due to an increasing number of customer escalations. In looking at these customer escalations we found that many of them were consistent in that they were problems of scalability and performance. The problems were almost always in the form of "Inventory collections are delayed or hit timeout errors"; "Metrics collections can’t keep up"; "Too many queue messages bog down the system". Of greatest concern was that if these escalations keep increasing at the current pace, then it will only get worse as we become distributed with other platforms.
The secondary reason for us doing the investigation was a need to embrace OpenShift/Kubernetes as a platform. In particular, we can manage OpenShift/Kubernetes, but we can’t run on it, which is confusing to users as they have a PaaS, but still need to deploy a virtual appliance.
In order to tackle these challenges, we set out to investigate what major changes would be required by our product to run on OpenShift/Kubernetes, and once there, what features of it could we leverage to tackle the scalability and performance problems. Additionally, since Kubernetes gives us the ability to play with numerous technologies, we would take this opportunity to try out new technologies and see what those technologies can do for us, particularly in replacing the home-grown things we’ve built over the last 10 years.
Teams
The approach was to take a number of developers, break them down into teams, and those teams could deep-dive into their area of expertise, with daily standups and demos to keep all parties in sync.
Breakdown
Prior to beginning their efforts, the transition to running on OpenShift was broken down into 4 phases.
The teams investigated Phases 1-3 with varying degrees of attention. For ManageIQ Gaprindashvili, there are some important deliverables, so in trying to work on Phase 2, they paid special attention to how the changes could be implemented in Phase 1 for reuse in Phase 2. They also thought ahead to how a "Bring Your Own Image" and microservice world might look like in Phase 3 in their designs for Phase 2.
OpenShift / Kubernetes
The team developed a proof of concept that ran various parts of our application as pods on OpenShift. While the ultimate goal is to run on Kubernetes, some OpenShift specific features were leveraged in the PoC. However upon deeper analysis, we believe we won't need those OpenShift specific features and can run fully on Kubernetes. Running on Kubernetes is preferable as it allows us the opportunity to promote ManageIQ to a much wider audience.
Orchestration
There will be a primary pod known as the ManageIQ Orchestrator. The purpose of this pod is similar to our current evmserverd process, and initially will be the same code as the evmserverd process. The evmserverd process is aware of the state of workers, knows to spin up/down the number of workers, watches heartbeats for liveness killing them as appropriate, and can also watch for CPU and Memory thresholds. These are all abilities of OpenShift, so marrying evmserverd with the OpenShift API, will allow us to leverage OpenShift and let them do what they do best.
Our new ManageIQ Orchestrator communicates with the OpenShift API, dynamically deploying worker pods, and scaling them up or down based on user changes in the ManageIQ UI. Eventually, it would be preferable to autoscale the workers based on some metrics such as number of requests or queue depth, thereby removing that burden from the administrator, but for now we will leverage the existing code to manually set worker counts.
The Orchestrator will also be responsible for launching dependent services. Dependent services are components of the architecture that are shared by all components, such as the PostgreSQL database pod, memcached pod, and others. These dependent service pods will have an OpenShift Service in front on them so they can be internally routed to.
Workers
There are 3 categories of workers: service workers, shared workers, and provider specific workers. These workers will run as separate OpenShift Deployments that are dynamically requested by the Orchestrator
Service workers are workers that need to be routed to and thus need to be load balanced behind an OpenShift Service. These include the UI worker, API worker, and Websocket worker. Each one is a separate deployment that can be scaled independently. The user’s path to these workers starts at the external OpenShift route, which will accept incoming connections on port 443, handling the SSL negotiation. This traffic will then pass through the external auth container for external authentication, which is described in more detail below [→]. Then, based on the incoming URL, the auth container will route to the appropriate Service for UI, API, or Websockets, and the Service will handle load balancing across the workers of that type. Additionally, the Orchestrator will deal with role-enabled service workers, such as the EmbeddedAnsible worker. This type of worker is only deployed if the corresponding role is also enabled.
Shared workers are workers that do not need to be routed to and thus don’t need a load balancing Service. They are the core workers of the ManageIQ platform. There are 2 types of shared workers: regular and provider-enabled. The regular workers include Generic, Priority, Reporting, and Schedule workers. These will work nearly the same as they do now, and can be scaled up/down as needed. Provider-enabled shared workers, known as "persisters", will only be deployed if a provider has been configured. These "persisters" will be described in more detail in the next part.
Provider specific workers come into play when a provider has been configured, and when that happens the Orchestrator will start a number of "collector" workers, "helper" workers, and shared "persistor" workers.
First, the Orchestrator will start a number of provider specific "collector" workers, handling inventory collection, metrics collection (if available), and event collection (if available). These "collector" workers will speak to the provider directly, collecting their information and placing them in a well-defined format in the new messaging system. Although the workers will start their lives as mostly the same Ruby code as they are now, they are actually decoupled from the ManageIQ application, and ultimately can be written in whatever language is best for that provider, and run in whatever image environment they need. This is what we are calling "Bring Your Own Image" [→].
Additionally, if needed, the Orchestrator will start a number of provider specific "helpers". These include things like the existing VimBrokerWorker for VMware, or perhaps a future native-operations microservice.
Finally, the Orchestrator will start a number of shared workers called "persisters". These persisters are responsible for watching the queues/topics for incoming data and persisting that data to the database. Since the data from the queue will be in a well-defined format, these persisters can be provider agnostic, and so they will be shared across all providers. The inventory persistor can leverage a new stream-based, partial update refresh strategy that can update the database in a more real-time fashion. The old refresh strategies will still be available for providers that can’t take advantage of this new strategy.
Authentication
The external authentication mechanisms will be extracted into a dedicated image that will act as a middleman between the external facing route and our internal worker Services. One important goal of the external auth image is that it will be application independent allowing for its reuse with any other container-based product running on OpenShift.
Configuration of the external auth container will be done with two OpenShift ConfigMaps. One ConfigMap is for the external auth configuration itself, including support for IPA, Active Directory, LDAP, and SAML / Keycloak. This ConfigMap will be generated by a separate "helper" container that can be run by the customer directly with an interactive script or interface, helping them set up their configuration. The generated ConfigMap can then be fed into the external auth pod.
The second ConfigMap is the application specific config map that will allow the developers to inject their own httpd conf file detailing the RewriteRules and RewriteConds to where the traffic should be routed within the project. For example, in ManageIQ, we will inject our application’s rules to route to the UI, API, and Websockets Services based on the URL.
The external auth image will also expose a small microservice to facilitate DBUS queries. This allows the application to make queries back to the pod for detailed group information of a particular user, should that information be necessary.
The external auth image will not be concerned with SSL traffic and certificates as that will be handled by the OpenShift Route as mentioned previously.
Note that since the external auth image will be running systemd internally (a requirement for SSSD), it does require the anyuid privilege in OpenShift. On versions of OpenShift that do not have the oci-systemd-hook enabled, such as MiniShift, then an additional sysadmin privilege will be needed.
MiqServers and Zones
In the old appliance model, each appliance was 1-to-1 mapped to an MiqServer, and a Zone was defined as a set of MiqServers. Zones would be used for grouping the MiqServers for various purposes. Combined with the ability to map Providers into a Zone, these purposes can be summarized as follows:
In the new OpenShift model, we can think of all of our nodes as one giant expanse of compute, and thus like one giant MiqServer. Thus, the MiqServer concept is no longer necessary and if MiqServer goes away, then the concept of grouping them together in Zones also goes away. By removing the "grouping" aspect of Zones, we can focus on the deeper underlying use of Zones which is for affinity to resources. OpenShift handles affinities by using Labels and Selectors, and we can leverage these to implement Zones.
To achieve the same affinities as previously desired, the user can label their OpenShift nodes with "zone_=true", and configure those nodes to have network connectivity to the external resource. These same zone names would be created in ManageIQ, and the providers would be mapped to the zones as is done now. The Orchestrator, aware that the provider is zone-restricted, would apply that zone selector when dynamically launching the workers, and those workers would be scheduled to run only on those nodes.
One potential downside to this approach is that shared workers, such as generic workers cannot be run with specific selectors as their replicas are automatically handled by OpenShift and they do not have an identity with which we could apply the selector. This means that we cannot have specific generic work items routed to a zone. However, in analyzing all of the callers of the MiqQueue it was found that all of the usages of zone fell into 1 of 3 categories which can all be handled with some code changes.
Additionally, there is one huge advantage to implementing Zones in this fashion. In the current model, since a Zone doubles as a "grouping" of MiqServers, there is no way to have an MiqServer in two different zones. In the new model, the same OpenShift node can be labeled with multiple zone labels. This means that depending on the network topology, this could reduce a significant number of appliances/nodes that would otherwise be necessary by allowing the customer to "double up" where the networks are shared.
Settings
An interesting side effect of removing the MiqServer concept (and by extension, the "grouping" aspect of Zones), is that Zone-level and MiqServer-level settings are no longer needed. (Note that this section is conceptual, and has not been vetted as part of the rearchitecture efforts)
We will still need the layering aspect to handle productization and environment specific settings. These are driven through files and would be part of the base container image. Changes to the configuration are currently stored in the database in the settings_changes table, and this can continue in the short term, however, the original purpose of that table was for the MiqServer changes, and with those removed, we will not need the settings_changes table in its present form, or perhaps at all.
One possible replacement for settings_changes is to just store the changes in a single file, and persist it somewhere. One option is to do this through a settings microservice, so that it can be the single source of truth for the "current" settings. Alternately, we could use a ConfigMap, but that might force redeployments and would be difficult to change via our UI.
Replication
Multi-Region replication is currently implemented using the pglogical service within PostgreSQL. pglogical uses logical replication over the same 5432 port that regular database traffic uses. Thus, the only requirement to make replication work is to be able to expose that port outside of the OpenShift cluster. There are a number of ways to expose services externally1, and the user can choose which one works best for them.
By default, we will use the Ingress IP Self-Service2 method. This works via a configuration on the OpenShift Service for the PostgreSQL pod. When set, this tells OpenShift to open a random port in the 30k-60k range on all nodes in the cluster, and map that port to the internal Service’s port. When configuring replication in ManageIQ, all that needs to be done on the global region is to use the hostname of any node in the cluster, and that random port number.
Memcached and Sessions
In the appliance model we ship with memcached as an efficient memory-based session storage for the UI. However, since memcached runs on each appliance, sessions must be made sticky to a particular appliance, otherwise if traffic is routed to a different UI worker on a different appliance, the session will be lost and the user logged out. To complicate the matter further, if the customer puts the multiple UI appliances behind an external load balancer, there is no way for us to make the sessions sticky, and so they are forced to switch the session storage to database-backed storage.
In the new OpenShift model, we can ensure that there is one, shared, memcached instance for the entire region. This completely eliminates the need for sticky sessions, and simultaneously eliminates the need to maintain the code the database-based sessions storage, greatly simplifying the entire problem.
Additionally, leveraging OpenShift allows us to experiment with new technologies. For example, the UI team has demonstrated that when running on OpenShift we can replace the backend storage for ActiveJob connections with Redis, which has full upstream Rails support, instead of using the current PostgreSQL backend, which has very little upstream Rails support. Then, once we have Redis as a dependent service, that could be used as a replacement for memcached.
Dual Deployment
Even with the ability to run natively in OpenShift, we still have customers that will be using VMware, RHV, etc, and want to deploy with the appliance model. From a development perspective, it is a costly overhead to maintain running on multiple platforms. Dual Deployment is the term for changing the way the appliance works such that each appliance is an OpenShift node inside. This way, we eliminate the overhead of ManageIQ running on multiple platforms, and can focus on running only on OpenShift. The customer however, should not need to be aware that they are using OpenShift under the covers, nor need them to become familiar with the various OpenShift terms and administration procedures.
Leveraging the openshift-ansible installer, we will install OpenShift inside a virtual image with a RHEL Atomic base. The image can be pre-populated with all of the images for the workers, or alternatively, we could have a smaller virtual appliance by not including the images, and they could be fetched on first use.
The first instantiation of the appliance would be the OpenShift master node. Subsequent virtual appliances would join the cluster as a regular node. In order to make this simpler for customers, we will need an interface similar to the existing appliance_console, but where the user can choose to create the first master node, or join the cluster. Additionally, the new console will need a way to allow setting zone labels on the node. One possible implementation could be to run Cockpit on the appliance, and implement this interface as a Cockpit module. (Note that the details in this paragraph are conceptual and have not been vetted as part of the rearchitecture efforts)
Service Providers
An important use case to consider is how service providers can provide a seamless experience to their customers without resorting to multiple regions. (Note that this section is conceptual, and has not been vetted as part of the rearchitecture efforts)
Presently, there are a number of constructs within ManageIQ that aid in supporting this, including tenants and appliance-specific branding, and appliance-specific external authorization. One example implementation in the current model is to set up a UI Zone for each tenant, apply appliance-specific branding and external auth configuration to each appliance in that Zone, create an external LoadBalancer in front of those UI appliances, and then set up the DNS entry for that sub-customer to point to that LoadBalancer. This requires a lot of extra appliances, and external resources, and is a pain to maintain.
In the new OpenShift model, the MiqServers will no longer exist, but that is where the appliance-specific branding is stored. Instead, we will need a different way to store branding, so we can leverage the tenancy structure, which already has support for branding. With the addition of a mapping between incoming hostname and a tenant, we can provide a tenant-specific login screen, and once a user logs in, we know their tenant and can show the appropriate application-level branding. For example, if company-a.example.com is mapped to TenantA and company-b.example.com is mapped to TenantB, we would show the appropriate branding based on the hostname of the incoming request and/or the user’s tenant association.
If separate SSL certificates are required for different hostnames, that can be implemented by having separate OpenShift Routes. The DNS entries would then route to the correct Route, and the routes would map to the same external authentication pod.
If separate external authentication is needed, then we can spin up a separate external auth deployment for each tenant, but map the outgoing traffic to the same internal UI, API, and Websocket Services.
Leveraging OpenShift will allow us to have a much simpler Service Provider experience, potentially eliminating a lot of extra appliances and workers that would otherwise be duplicated.
Log Collection
For diagnostic purposes and bug triaging, log collection is of utmost importance, however in the OpenShift model where pods are coming and going, it would be very difficult to collect log files on demand, not to mention that after some time those files are no longer saved for terminated pods. This is where the Red Hat Common Logging initiative will help us.
Common Logging is implemented using the EFK stack (ElasticSearch, fluentd, Kibana). Fluentd is configured to ingest anything written to STDOUT by the various pods and write that information into ElasticSearch. If the output written to STDOUT is in JSON with specific keys, then that output can be parsed and put into ElasticSearch in a much more discoverable way. Kibana can then be used to look at the data stored in ElasticSearch so a user can visualize and search.
For the purposes of offline log inspection, however, we would need these logs exported. This can be done using a tool called elasticdump, which will export the Elastic database. That export can be transferred to us, where we can import it and inspect it with Kibana,
For the Phase 1 implementation, we have already changed our logs to write to STDOUT in a well defined JSON format, so Common Logging is already able to be used if it is present. For Phase 2+ we would need to ensure that either OpenShift comes with Common Logging enabled, or we would need to launch those services ourselves.
Database Backup and Restore
In the appliance model, database backup and restore are part of the responsibilities of the appliance, but in OpenShift we will not have access to the underlying storage of the PostgreSQL database from the workers nor from the new appliance_console. Instead of the application being responsible for backup and restore, we will transfer this responsibility to the customer via one of two methods. If the customer is using an underlying storage system that supports storage snapshots, then they can use those. Alternately, a dedicated container run as an OpenShift Job, will connect to the database pod and perform a full binary backup of the entire database cluster, based on pg_basebackup. For Phase 1 implementation, this has already been documented3.
In the Dual Deployment appliance model, we will need to automate this via the new "appliance_console". (Note that this paragraph is conceptual, and has not been vetted as part of the rearchitecture efforts)
Upgrades
The ManageIQ Orchestrator is responsible for starting and stopping all of the workers and services, so an upgrade is a matter of just updating the Orchestrator. Once the Orchestrator is updated and redeployed it could bring down all running instances and relaunch them. Database migrations are always run by the Orchestrator on startup. (Note that this section is conceptual, and has not been vetted as part of the rearchitecture efforts)
Performance
The performance team focused on what could be done with the existing workers to lower the memory usage, and thus allow us to streamline our workers better. They did not focus on what kind of savings we would get just switching to the OpenShift model alone, but we are expecting some savings that will come from:
Bundler Groups
Bundler groups are a way of categorizing the various rubygems that comprise the application, and are defined in the Gemfile. When gems are categorized by feature, then we can specify which groups to enable or disable when launching a worker, significantly reducing the amount of code loaded, memory used, and overall boot time of a worker. For the workers with the most savings, we found we could drop the worker baseline memory usage from ~144MB to ~100MB or ~30.5%. Additionally, less memory means less time spent in GC in Ruby, meaning the Ruby processes are more efficient.
Additional Potential Savings
Queue
The Queue team focused on our use of the MiqQueue from two perspectives. One was examining how we actually use the MiqQueue, and the other was investigating a number of messaging systems to see which one fulfilled as many of our use cases as possible. ActiveMQ Artemis was ultimately chosen as the new successor to the MiqQueue.
manageiq-messaging gem
The team built an abstraction layer gem over the ActiveMQ Artemis client libraries, named manageiq-messaging. In doing so they created a simple API, which will simplify the transition away from the MiqQueue.
The gem also creates abstractions for 3 different use cases: background jobs, queue messages, and topics (aka pub/sub).
Background Jobs are very similar to the current MiqQueue style where the producer defines the exact class, method and args. A worker subscribing to background jobs, such as the generic worker, does exactly what it’s told to do, so the logic for "how" to do the work in on the producer side.
Queue messages are slightly different than Background Jobs in that they represent a request for action, but without specifying how to accomplish the request. The "how" to do the work is implemented on the subscriber side instead.
For example, with background jobs, if someone clicks the start button on a VM in the UI, you would put the specific provider’s class name, and the "start" method, on the queue. An operations worker might pick up that work and run it exactly as described. However, with queue messages, we instead just put a "start" message with the provider id into a general queue or into a provider specific queue. The operations workers could then watch that queue filtering out messages it doesn’t care about and only handling the messages it does care about. The logic for "how" to handle the message is in the operation worker. This distinction is important as it can allow the platform (the producer side) to be more provider agnostic, leaving the details to provider-specific workers (the subscriber side).
The third use case is topics (aka pub/sub). Topics are similar to an event stream where a producer just emits events into the topic, and multiple subscribers can act upon some or all of the messages it sees in that topic. These are very useful as the channel between provider-specific collectors and provider-agnostic persisters as described previously, and described in more detail later [→].
Ongoing Challenges
Providers
In general, the greatest performance bottleneck for nearly all aspects of providers is the usage of the MiqQueue. So, the focus of the providers team was primarily finding ways to change how collectors work so that they wouldn’t use the MiqQueue anymore, or if a messaging system was still needed, leverage features from that new system instead. Additionally, some aspects of the code organization lead to process bloat, which has been addressed in the new designs.
Collector / Persister split
A common pattern for the rearchitecture of providers is what we are calling the "Collector / Persister split". This refer to the separation of native-side collectors and platform-side persisters. In the current application, the collector/persister split only exists for events, with the MiqQueue as the intermediary. For inventory and metrics, collection and persistence happen in the same process. Being in the same process tends to bloat the process because it needs to carry both the code and data for both sides of the equation.
Bring Your Own Image
An important goal we are striving for is "Bring Your Own Image". Allowing provider authors to have code written in the language they find best allows them to focus on the task at hand, allowing them to use the most optimized languages and technology. Typically, the provider’s client library that is most kept up-to-date is the one written in the native language, so allowing the developers to write in that language allows them to keep as up-to-date as possible. Additionally, as ManageIQ grows, we want it to be the de-facto platform for management, and one important way to accomplish this is to ensure that we are aligned with the upstream communities of the providers we manage. "Bring Your Own Image" helps that effort by keeping the code consistent with the code of the provider, thus allowing any direct contributor to the upstream provider to be a contributor to the management plugin. For example, authors of a Go-native provider don’t have to learn Ruby to contribute to ManageIQ, but can write their provider plugin in Go, using their Go-native client library.
Events
Since events mostly subscribe to the collector / persister split already, the focus was on eliminating the MiqQueue. The MiqQueue becomes a bottleneck because it has a difficult time handling a large stream of events. A heavily used provider, or a flood of events (aka an "event storm"), can bog down the MiqQueue itself, which in turn bogs down the entire application.
The proposal is to replace the MiqQueue directly with ActiveMQ Artemis topics. Collectors (which can be written in their own language) will capture the events, and write the data directly to the ActiveMQ Artemis topic. From there, multiple subscribers can read from that topic, handling the events as appropriate. One subscriber will be the platform-side EventHandler (renamed to EventPersister), which will watch the topic for events it cares about, and writing them to the database for reporting purposes in timelines. It will ignore events it doesn’t care about, even those that won’t ever appear on the timeline. A second platform-level subscriber, named the AutomateEventHandler, will watch the topic for events that the user has written automate handlers for, and react to only those events directly. Providers themselves may choose to write provider-level subscribers, such as an inventory collector driven off of events instead of polling. By subscribing to the events, they can use the data stored in the events to do a more efficient inventory collection.
Inventory
For inventory, some of the providers subscribe to a collector/persister split at the source level, but the code is still running in the same process. This causes massive code bloat because typically the process collects a lot of data, keeping it all in memory, and can’t free that memory until the persister has completed. The persister needs to make a "copy" of that data for the purposes of writing to the database, and thus you end up with duplicate copies of the data in memory. From a scalability perspective this is a huge problem for very large providers, in particular public images of a cloud or container provider.
The proposal to solve these problems is to enforce the collector / persister split. Collectors (which can be written in their own language), are responsible for collecting the data from the provider and placing them into a ActiveMQ Artemis queue in a well-defined format. On the platform-side, persister processes will watch the queues and write that data to the database. This keeps the processes separate, keeping memory levels stable.
Another problem on the inventory side is the usage of the MiqQueue to communicate requests for updates to the refresher process. The inventory code, in order to prevent duplicate requests in the queue, and to "rollup" requests for the same provider, modifies existing queue items to "add" provider requests to them. As described earlier, the MiqQueue feature of being able to modify entries is a major cause of problems in the queue, and is also not available in a new messaging system.
To avoid the queue manipulations, collectors will be responsible for having the knowledge of what to collect, but more importantly, when to collect it. Every provider has drastically different mechanisms for knowing when to collect. Some rely on events, some have callback mechanisms, and others have nothing except for constant polling. For example, in the VMware provider, we have the WaitForUpdates method which allows a callback with the exact inventory changes, and those can pretty much be written directly to the database, but we can’t use it. Instead, we use events, and part of event handling is to put a targeted or full refresh request on the queue. This makes it impossible to leverage the super-efficient mechanism already provided by VMware. Leveraging the WaitForUpdates method directly would also significantly drop the collector memory because only a very small amount of changes are in memory at any given time.
A third problem on the inventory side is that most providers can only handle a full refresh, meaning that the entirety of the inventory must be queried for, compared to the database, and the changes written. For the initial refresh, and refreshes where you want to "fix" the data such as on a reboot, this is acceptable, but for regular ongoing updates it is not. We do have the concept of "targeted refresh", but it is complicated for a provider author and only supports a small set of objects, namely Vms + Hosts on Infra and Cloud providers only.
Going back to our VMware example, even if we used WaitForUpdates directly, we can’t update the database because the only refresh strategies we have are full refresh or partial refresh. So, the proposal to solve this is two-fold. First a new graph refresh strategy is to be implemented (development for this had already started prior to the rearchitecture). The graph refresh is a more advanced refresh strategy which can understand partial updates to the inventory, allowing targeting of any object in the inventory data. Second, the provider collector will publish these partial updates to a ActiveMQ Artemis queue, where the persistor will watch for these partial updates and apply them. The data will thus come into the system in a more real-time fashion. In the example VMware provider, the results of WaitForUpdates can be put as a partial update into the queue.
Metrics
The metrics team mostly focused on the backend storage mechanism for metrics, believing it to be the primary bottleneck. As such, much of the time was spent in investigating alternate backends until we realized we have higher-level problems in the application itself.
Storage backend
The current storage mechanism for metrics is the PostgreSQL database. The team researched a number of Time Series Databases (TSDBs), but each one researched seemed to have some problem that gave us pause in choosing that database as the new backend storage mechanism. Additionally, we found that our problems come more from how we read and write data into the current storage mechanism, than from the storage itself.
One of these problems is what we call the "20 second interval" problem. The original metrics implementation was written when VMware was the only provider we supported, and thus many of the decisions made were around VMware’s "realtime" collection interval of 20 seconds. The database storage mechanism itself doesn’t care about 20 second intervals, however the application, on both the reading and writing sides are expecting the data in 20-second intervals. Even if we changed the storage backend, we would still have this problem and would have to account for it anyway. One major problem with this is that provider authors are constrained, and many need to add incredibly complicated code just to manipulate their metrics into 20-second buckets. The only way to solve this is to completely remove the 20-second interval restriction on both the writer’s side and, more importantly, on the reader’s side.
Another problem is the strict schema of our metrics tables. Each row in the table stores the metric value in a column for that timestamp. However, this restricts the metrics to a predefined set of columns. New providers bringing new metrics, or even new metrics in existing providers, require changes to the schema, which makes it more difficult to bring new things to the application. The only way to solve this is to completely replace the strict schema with a more flexible mechanism. This could be implemented by choosing a new storage backend or just changing the existing storage backend to be a jsonb column.
Scheduling of collection
As far as scalability goes, the existing database, while it has some problems that need to be addressed, doesn’t really contribute to the main scalability problem. The bigger problem is in how we schedule metrics collections, and how those collection requests are communicated to the collector workers via the MiqQueue.
In the current architecture, collector workers are not provider-specific, but are instead "shared" workers that do both collection and persistence, leading to multiple problems.
The proposed solution to this is to change how metrics collection scheduling works by following the collector / persister split pattern. Collectors (which can be written in their own language), are responsible for collecting the metrics from the provider and writing them into an ActiveMQ Artemis queue. The collector can collect in whatever they deem the most efficient way. More importantly, much like in the inventory changes, it will be up to the collector to determine what and when to collect the data. This eliminates the platform-side scheduling bottlenecks entirely, and allows the provider author to make the decision on how best to determine which metrics to collect and how frequently. Persisters will then read from that ActiveMQ Artemis queue and persist the metrics to the storage backend. Additionally, this could theoretically allow for the implementation of a long-requested RFE where the provider may want to write "directly" into the ManageIQ database. It could theoretically be implemented as a "collector", but one that runs in the provider itself, writing to a well-defined API endpoint, which, in turn, would write to the same ActiveMQ Artemis queue.
ActiveMetrics gem
Much like an abstraction layer was written against ActiveMQ Artemis, there will also be an abstraction layer written against the storage backend, which we have called ActiveMetrics. In order to get our code away from the 20 second intervals and the details on how the records are stored, ActiveMetrics will provide an abstraction to both reading and writing, allowing us to transition to alternate backends.
Automate
A number of challenges have been found with the existing automate infrastructure, which led us to investigate what changes we could make if we could leverage OpenShift.
Launching methods and DRb
In the current application, automate methods are written in Ruby, and are launched in a separate child process. This separate process is important, because it sandboxes the customer written methods (to an extent), and prevents them from modifying the ManageIQ application directly, even accidentally. In order to communicate with the child process we use DRb. DRb (Distributed Ruby) is a Ruby-based inter-process communication layer that uses serialization of Ruby objects. While very useful, it is extremely tricky to manage properly and has led to a number of major escalations over the years, not to mention that memory bloat and performance problems it has introduced. Over time we’ve come to learn that the usage of DRb is major source of problem within our system.
When we evaluate an Automate model, that resolutions results are stored in a "workspace" in the process’ memory. As we walk through the steps of the resolution we continually update that workspace. When we launch an automate method written in Ruby, we establish a DRb connection between that child process and the parent process, so that the child process can also modify the workspace stored in the parent, but also to access any of our objects from the database. However, by leveraging DRb, not only do we have the aforementioned problems, we also cannot support any other method language except Ruby (including the new Ansible playbook methods).
To solve this, in Gaprindashvili, we are creating an alternate way to access the workspace via the API. Before launching an automate method, the parent process will export the in-memory workspace into the database, which can be access via the API. Then, the child process can modify that workspace over the API, and it can also access any information it needs also from the API. When the method is done, the parent process can then look at the changes saved to the database, and update it’s in-memory model accordingly. This further isolates the automate methods, and creates a uniform, cross-language way to communicate with the system.
With the API-based communication in place, we can then leverage OpenShift by having each automate method run as a container. When defining a new automate method, the author would choose the image for that method. A new automate worker would be responsible for talking to the OpenShift API, and launching these automate methods as deployments, on demand. By isolating with containers, we not only provide even stronger sandboxing, but can also allow "Bring Your Own Image" environments for those methods. One problem that automate method authors run into is that they are forced to use not only Ruby, but also whatever Ruby gems are available on the appliance, and modifying the appliance environment is tricky and potentially dangerous. With "Bring Your Own Image", that problem goes away entirely, as the customer can put whatever they want into the image, writing in whatever language they want, and do not have to manipulate the ManageIQ environment.
Additionally, there has been a long-requested RFE to have specific methods run in specific Zones due to the affinity reasons described above. This would now be much easier to implement by having each method define what zone it needs to run in, and with Zones defined as OpenShift Labels, the Automate worker would then launch the automate method’s container with the appropriate selector.
1 https://docs.openshift.com/container-platform/3.6/dev_guide/getting_traffic_into_cluster.html 2 https://docs.openshift.com/container-platform/3.6/dev_guide/getting_traffic_into_cluster.html#using-ingress-IP-self-service 3 https://github.com/ManageIQ/manageiq-pods#backup-and-restore-of-the-miq-database