Measure and graph velero backup and restore system performance impact varying API server object count, types, and sizes

hatfieldbrian commented 2 years ago

[ ] Verify fix for backup time is still there and within expected performance
[ ] Debug restore time outliers (why is it taking an hour for 10 resources, 2-3 took 40-45 minutes total)
~~Prepare presentation for Kubecon? Acceptance/Rejection notice expected by Aug 1 at latest~~

pmuench1717 commented 2 years ago

RamenDR/ShioRamen#11

tjanssen3 commented 2 years ago

Since this task has been open for a while, I wanted to summarize some findings thus far across a few comments. I've deployed Velero via OADP Operator to a VMsphere cluster called cluster7.local, which runs OpenShift 4.7.

A WIP data collection spreadsheet can be found here, working files and scripts for portability are here.

tjanssen3 commented 2 years ago

The overall goal of this task is to determine how object counts and sizes effect Velero backup and restore performance. Since object size has a known upper bound (the size of etcd in the cluster), it was investigated first.

Breaking cluster7 and cluster8 with too many ConfigMaps Object size was tested by creating large objects and placing them into etcd. ConfigMaps were chosen because they can contain arbitrary structures and don't need to be scheduled like Pods. Originally, ConfigMaps about 250k in size were used. Using kubectl apply -f creates annotations and doubles the size of the file to 500k, but there's also a limit of 262144 bytes on annotations. Using kubectl create -f does not create annotations, so the file size is preserved (250k, here). Since the default maximum file size for a request in etcd is 1.5MB, that became the new target. Attempting to create a 1.5MB ConfigMap where all the data was in a single field produced an error: The ConfigMap "game-demo-0" is invalid: []: Too long: must have at most 1048576 bytes. I believe this is an error with all the data residing in a single field, but since ~1MB is an easy number to do math with, I didn't investigate this and started creating ConfigMaps of 1MB size with kubectl create -f. The script for creating ConfigMaps in large numbers is here, but the API server became unresponsive after adding ~680MB of ConfigMaps. "Unresponsive" here meant that any activity that required using the kube API server was unavailable and would time out. Interestingly, the nodes would be available to ssh into via the core user, so they weren't taken down completely.

Since etcd reported that there was ~7.4GB space available in total, adding ~680MB, this looked like a bug and we contacted RedHat support to help us debug it (issue here). Summary: in order to support performance guarantees, swap space is disabled (validated with free -h); when the system began to run out of memory (etcd copies are kept on the master nodes), various pods began to crash or become unresponsive as a result, without an indication as to why. The solution: add 16GiB additional memory to the master nodes, up to 32GiB (we used the minimum requirements before). Cluster7 and 8 (not discussed yet, but the same problem existed on both clusters) became usable again.

tjanssen3 commented 2 years ago

Some resource backups take longer than others After taking backups of various resources, a pattern emerged: there are, effectively, two classes of resources - one that backs up slowly, and another that backs up quickly. The slow resources group consists of deployments, replicaSets, and pods. The fast group is everything else we tried (configMaps that are 4k in size, secrets, CRDs, CRs, PVs, PVCs, clusterRoles, clusterRoleBindings, services, serviceAccounts).

With a cursory investigation the Velero code base, it appeared that all resources were backed up the same way, and eventually are copied to json with DeepCopy. The theory became that because the "slow" resources had more fields, they were slower to back up. This was disproved by matching the same amount of fields in the average deployment we used (67) with a dummy CR with the same number of fields. Backing up the same number of CRs took significantly less time (~40 seconds versus 3 minutes and 40 seconds), so a new theory was needed.

When backup objects are created, they come with a log file. Investigating backup log files showed that there are additional steps performed on each object when it comes to Deployments and Pods. In the output, this kind of output can be seen:

time="2022-04-14T19:37:56Z" level=info msg="Backed up 1 items out of an estimated total of 3353 (estimate will change throughout the backup)" backup=openshift-adp/backup-busybox-3353-deployments logSource="pkg/backup/backup.go:398" name=busybox-1 namespace=velero-test-busybox-1k progress= resource=deployments.apps
time="2022-04-14T19:37:56Z" level=info msg="Processing item" backup=openshift-adp/backup-busybox-3353-deployments logSource="pkg/backup/backup.go:358" name=busybox-10 namespace=velero-test-busybox-1k progress= resource=deployments.apps
time="2022-04-14T19:37:56Z" level=info msg="Backing up item" backup=openshift-adp/backup-busybox-3353-deployments logSource="pkg/backup/item_backupper.go:121" name=busybox-10 namespace=velero-test-busybox-1k resource=deployments.apps
time="2022-04-14T19:37:56Z" level=info msg="Executing custom action" backup=openshift-adp/backup-busybox-3353-deployments logSource="pkg/backup/item_backupper.go:327" name=busybox-10 namespace=velero-test-busybox-1k resource=deployments.apps
time="2022-04-14T19:37:56Z" level=info msg="[common-backup] Entering common backup plugin" backup=openshift-adp/backup-busybox-3353-deployments cmd=/plugins/velero-plugins logSource="/remote-source/src/github.com/konveyor/openshift-velero-plugin/velero-plugins/common/backup.go:52" pluginName=velero-plugins
time="2022-04-14T19:37:56Z" level=info msg="[GetRegistryInfo] value from imagestream" backup=openshift-adp/backup-busybox-3353-deployments cmd=/plugins/velero-plugins logSource="/remote-source/src/github.com/konveyor/openshift-velero-plugin/velero-plugins/common/shared.go:37" pluginName=velero-plugins

The "Executing custom action", "common backup" and "imagestream backup" lines do not appear in the "fast" backup types; they're ommitted entirely (verified on a Secrets-only backup). These look like they come from plugins, based on the description. The OADP instance uses three by default: aws, kubevirt, and openshift.

Current experiments remove each of the plugins individually to see if the backup times are effected. Note that updating the OADP operator in place had no effect. Next attempt: recreate OADP operator with plugins removed. This space will be updated with results when they are available.

tjanssen3 commented 2 years ago

It appears as though the openshift plugin is adding backup time to Deployments.

For future reference: if a user removes a plugin in the OADP operator configuration on an existing installation, the plugin will not be removed. Recreating the Velero/DataProtectionApplication is required. However, since this eventually becomes a pod, and I had 3000+ pods on cluster7 already, the pod couldn't be scheduled, requiring me to remove the excessive Busybox pods. After recreating my Velero instance and recreating the Busybox pods so they were in the same state as the previous backups, I tested two additional plugin configurations. The process of recreating thousands of Busybox instances multiple times was tedious, but the information coming out of the experiments looks promising.

The initial Velero plugins were aws, kubevirt and openshift. So, all backup times until this point include those three plugins; they were in the okd documentation that was used to configure Velero in the first place. Removing the kubevirt plugin had no effect on backup time. Removing the openshift plugin changed backup times on Deployments from about 11 minutes and 45 seconds to 29 seconds for the same contents.

When comparing the logs of the backups taken with the openshift plugin and without the openshift, the "extra" lines for Deployments were omitted. So, it appears that the extra computation performed by the openshift plugin in those "extra" lines is the cause of the dramatic slowdown of Deployments, at least. Replicasets exhibited a similar speedup: 11 minutes 41 seconds down to 27 seconds with no openshift plugin. Pod backup time was also reduced, but not by as much: 11 minutes 41 seconds down to 52 seconds with no openshift plugin.

It appears as though we've discovered a great way of speeding up backups of the slowest resources. The next thing I wanted to check was the effect on "fast" resources, so I leveraged the existing large (1.1MB) and small (4k) sized configmaps already present on cluster7. Small configmaps went from 28 seconds on the original installation to 9 seconds without openshift and kubevirt plugins. Large configmaps went up slightly; 2 minutes and 21 seconds on the original installation to 2 minutes and 29 seconds without openshift and kubevirt.

The question I have now is: what is the openshift plugin doing that causes this dramatic slowdown, and why does it need to be done to each object individually? Is there any benefit for using this? Further investigation is required.

tjanssen3 commented 2 years ago

Looks like there's an optimization possible for the OADP Openshift plugin, resulting in filing an issue. Details are here: "The common backup plugin is attempting to read the registry route from a configmap on every run. This is leading to significant performance problems at scale".

For now, I've disabled the Openshift plugin and am getting backup and restore data for 1k instances of the following resources: deployments, statefulsets, pods, small configmaps, large configmaps, services, secrets, service accounts, roles, rolebindings, pvcs, cluster roles, cluster role bindings, pvs, and CRs.

WIP data will be kept in the existing spreadsheet, but this is for presentation next week. At this point, all objects exist on cluster7, and I'm going down the list of backups and restores remaining to collect the data we need.

When the presentation is complete, we can probably call this task complete and move to more specific instances where we're missing data, if desired.

tjanssen3 commented 2 years ago

All the data I have was presented during the Red Hat deep dive and Fusion workshop the week of 4/25. Based on those meetings, we got some feedback on how to expand the testing to larger workloads. The OADP team has also introduced a PR (https://github.com/openshift/openshift-velero-plugin/pull/133) which I've tested to address the performance scaling issue we reported here (https://issues.redhat.com/browse/OADP-459). The new updates bring Deployment backup times in line with "fast" resources (details in the PR). This change hasn't been merged yet, but is slated for release on OADP 1.0.3.

I think we can close out this issue and create another one for additional issues relating to further study or scaling performance, assuming @pmuench1717 @hatfieldbrian are both ok with this.

RamenDR / ramen

Measure and graph velero backup and restore system performance impact varying API server object count, types, and sizes #973