apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Keep Minikube running forever on the Jenkins machine #394

Open mccheah opened 7 years ago

mccheah commented 7 years ago

It's unclear how the build system has been handling the fact that every integration test run tries to start and stop the Minikube VM every time. For example, if two tests run concurrently, it is theoretically possible that one test could start the Minikube VM and the other one could stop it before the first test is finished using it. It seems relatively risky and flake-prone to modify the global state this way, although it remains to be seen if this is the cause of any specific build flakes from this fork's integration tests.

We can remove the risk by:

A more robust approach would be to use a remote Kubernetes cluster as opposed to Minikube so that we can scale the cluster given the demands of the builds that are running, but the stopgap described above would be a smaller change to the existing environment.

ifilonenko commented 7 years ago

+1 Since for images that are quite large, it would be strategic to build / pull them once and leave them up. i.e. large hadoop-images for the kerberos testing environment thoughts? @varunkatta @ssuchter

varunkatta commented 7 years ago

Current Jenkins setup runs exactly one integration test at time. If a integration test is already running then any new test will be queued. So, there is no issue of multiple integration tests stomping on each other.

Starting and stopping Minicube itself is fast and has a low overhead. Given the current volume and frequency of PRs, the turn around time for integration test result after a PR submission is quite OK. Is there something urgent we need to address here?

mccheah commented 7 years ago

I wasn't aware of the queueing, but I think forcing queueing of integration tests is itself a bit bothersome because it slows down verification in general. Also for our work on Kerberos integration testing we will be deploying the Hadoop docker image repeatedly which is static and should be cached on the Minikube instance between runs, meaning that we shouldn't be deleting the Minikube VM between runs - though perhaps we can change from delete to stop without much effort.

ash211 commented 7 years ago

I haven't noticed the minikube start/stop being the expensive part of integration tests, it seems more like it's the running of the integration tests themselves.

I wouldn't want to invest in speeding up part of the whole integration test Jenkins job (the minikube startup) without confirming that it's a heavy contributor to overall job timing.

mccheah commented 7 years ago

This can be perhaps deferred to after we start using Hadoop in the integration tests then, but we don't want to pull the Hadoop image every time - that's 1.3GB or so of data we would prefer to keep cached.