apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Integration testing on master #568

Open foxish opened 6 years ago

foxish commented 6 years ago

Elaborating on https://github.com/apache-spark-on-k8s/spark/pull/545#issuecomment-348267882 We need to separate our integration testing out - so we can make it run any arbitrary branch - including master.

cc @apache-spark-on-k8s/contributors @ifilonenko @erikerlandson @ssuchter @kimoonkim

erikerlandson commented 6 years ago

are you thinking of factoring the IT into a separate repo?

foxish commented 6 years ago

I think that might be one way. If we just have the integration tests be built separately, as a project by itself, and then invoke it against master. It might not be the most expedient - open to other suggestions as well.

erikerlandson commented 6 years ago

One advantage of that would be that it would be convenient to make separate branches, having different IT subsets targeted to various upstreaming stages of functionality

erikerlandson commented 6 years ago

Would we then have to re-integrate them into the upstream so it has the correct IT?

foxish commented 6 years ago

It might be desirable for upstream also to have the same decoupled integration testing, similar to YARN/Mesos.

erikerlandson commented 6 years ago

+1, that makes it a two-fer

erikerlandson commented 6 years ago

maybe even a 3-fer, it keeps more code out of the main upstream PRs

erikerlandson commented 6 years ago

cc @felixcheung

tnachen commented 6 years ago

Having it outside of the main repo is easier and faster to iterate, which is why we did it also for Mesos..

erikerlandson commented 6 years ago

I added a tracking issue for jenkins-infra: https://github.com/ucbrise/jenkins-infra/issues/86

felixcheung commented 6 years ago

when you say decoupled testing, same as YARN/mesos, do you mean the test and source will be completely outside of the Apache Spark git repo?

(because as far as I can see, there is no open source YARN integration tests for Spark, for example)

ssuchter commented 6 years ago

But I’d think that not having OSS YARN integration tests is a bad thing. I think having some would have been good. In the current state, it’s hard for contributors that aren’t at a company that has their own private test suite to be confident of their changes, especially significant ones. I think we can have some for K8s, it’s a good thing.

As mentioned in the thread, it doesn’t have to be in the same repo. The disadvantage of this, though, is that they are harder to keep in sync - if we change Spark in some way that enables/requires an integration test change, it becomes harder for every user of the integration tests.

Personally, I think the advantages of a separate repo outweigh the disadvantage I listed above. I think we should go for a separate repo.

Sean

On November 30, 2017 at 10:29:25 PM, Felix Cheung (notifications@github.com) wrote:

when you say decoupled testing, same as YARN/mesos, do you mean the test and source will be completely outside of the Apache Spark git repo?

(because as far as I can see, there is no open source YARN integration tests for Spark, for example)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache-spark-on-k8s/spark/issues/568#issuecomment-348412960, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2E24mGsg57hAvHIJkJR80AQlhGgm8Lks5s75zEgaJpZM4Qw9mJ .

ifilonenko commented 6 years ago

I agree with the separate repo. However, this will require a group of separate group to maintain and monitor this repo as well.

felixcheung commented 6 years ago

If it's about fast iteration, I'd say go for it. At some point though, I'd agree with @ssuchter OSS (or even, under ASF) and same branch/keep in-sync is very important.

foxish commented 6 years ago

I agree - we shouldn't look at it as a separate thing long term. It's expedient for now to separate it out - to get some confidence in upstream by running tests against it. Eventually, it should live alongside the rest of the code upstream, but I do think it needs some hardening before that, and we can't hit that for 2.3

foxish commented 6 years ago

Created https://github.com/apache-spark-on-k8s/spark-integration.

mccheah commented 6 years ago

We probably want to use the testing version provided by https://github.com/apache-spark-on-k8s/spark/pull/521. There's one more change required to make this pass on Amplab Jenkins. I'll patch this branch and bring it up to speed with our mainline, then we should think about how we're going to move the code over.

Automation is something to consider as well. We probably want to run these tests in the separate repository on every commit to master that affects Kubernetes. We could run the suite on every commit to master in general as well.

foxish commented 6 years ago

I think we can break the integration testing out further after #521 is merged.

The immediate next steps I see are:

For now, it would still depend on minikube. In the future, we can do even better.

This would enable running on other k8s clusters, enabling for example, stress testing on GKE if we want to do that, and also let other users test the release on their own clusters using our integration tests.

cc @ifilonenko @mccheah

mccheah commented 6 years ago

Are we still planning to run integration tests in the Jenkins PR builder on master? Or was the plan to run the tests in our own automated system?

foxish commented 6 years ago

Not sure if we'll get that done in time. With this split, even if we don't get the integration tests checked in for Spark 2.3 or the spark prb changes, we'd still have the option to run it ourselves.

mccheah commented 6 years ago

@foxish I was thinking about this a little more and am concerned that this setup isn't too intuitive. (edit: the proposed setup being https://github.com/apache-spark-on-k8s/spark/issues/568#issuecomment-349492072)

What these semantics are basically saying is that the main repository depends on the tests. This communicates that the main repository is checking the correctness of the tests, not the other way around.

To illustrate this point, consider the situation where a developer writes a new integration test for a feature they just merged into master. Consider the workflow:

  1. Developer writes unit tests and main feature in apache/master. They don't update the integration test version because integration tests aren't published yet. The change is merged.
  2. Developer writes and publishes integration tests that are broken for the feature.
  3. Developer updates main repository's integration test dependency to try the new tests.
  4. Developer finds the tests are wrong, so they have to publish a new integration test version.

This sequence of events makes it clear that the main repository is determining whether or not the integration tests are correct. But the more intuitive mental model is that the integration tests are validating the correctness of a given Spark artifact.

Consider also the situation when the main repository is changed such that the assumptions made by the given integration test version are invalid. In this workflow, the developer is blocked from merging into master before the new integration test version is published. But in between the time that the new integration tests are published and the main change merges, that new integration test version is only applicable for a version of the main repository that doesn't exist yet.

So I think we want the integration tests to pull down and test the repository under test, not the other way around.

ssuchter commented 6 years ago

So I think we want the integration tests to pull down and test the repository under test, not the other way around.

I definitely agree. I think this direction of dependency (repo under test is unaware of what will test it, repo containing tests is aware of what it will be testing) allows you to more easily make advanced changes to the main repo and integration tests in synchronization.

Sean

On December 6, 2017 at 10:35:27 AM, mccheah (notifications@github.com) wrote:

I was thinking about this a little more and am concerned that this setup isn't too intuitive.

What these semantics are basically saying is that the main repository depends on the tests. This communicates that the main repository is checking the correctness of the tests, not the other way around.

To illustrate this point, consider the situation where a developer writes a new integration test for a feature they just merged into master. Consider the workflow:

  1. Developer writes unit tests and main feature in apache/master. They don't update the integration test version because integration tests aren't published yet. The change is merged.
  2. Developer writes and publishes integration tests that are broken for the feature.
  3. Developer updates main repository's integration test dependency to try the new tests.
  4. Developer finds the tests are wrong, so they have to publish a new integration test version.

This sequence of events makes it clear that the main repository is determining whether or not the integration tests are correct.

Consider also the situation when the main repository is changed such that the assumptions made by the given integration test version are invalid. In this workflow, the developer is blocked from merging into master before the new integration test version is published. But in between the time that the new integration tests are published and the main change merges, that new integration test version is only applicable for a version of the main repository that doesn't exist yet.

So I think we want the integration tests to pull down and test the repository under test, not the other way around.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache-spark-on-k8s/spark/issues/568#issuecomment-349733246, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2E28lxPQj4SQIZe8Cw4SjqVlWyjqsXks5s9t5vgaJpZM4Qw9mJ .

foxish commented 6 years ago

So I think we want the integration tests to pull down and test the repository under test, not the other way around.

I agree with that @mccheah. I wasn't proposing the opposite but I guess it got confusing since I went into a bit of the implementation as well in my comment. I don't think we want to "publish" the integration test jar as a separate entity. My point was that jenkins could take care of building the latest integration tests (from wherever they live, which may be for now - https://github.com/apache-spark-on-k8s/spark-integration), and the distro (from a particular PR), and then have the integration tests run on the distro. The point about jars was just to make the separation point and isn't really a necessity.

I don't mind having the integration tests actually pull down the repo and build it for now if it helps save time. I think it's the same whether jenkins does, or if we have the test logic itself do it.

mccheah commented 6 years ago

Ah, sorry for the misunderstanding there. I think we want the integration tests to target a git hash on either a remote repository to clone or a local checkout for development iteration. But maybe we should support testing against Spark distribution tarballs also because we can save build time by using a pre-published Spark distribution instead of building it ourselves in the integration test CI.