FRosner / drunken-data-quality

Spark package for checking data quality
Apache License 2.0
222 stars 69 forks source link

Multi-Spark-Version-Test #30

Open FRosner opened 9 years ago

FRosner commented 9 years ago

Description

MSVT ftw! We need to understand what it takes and try if it works and is worth the effort.

References

JoshRosen commented 8 years ago

Just curious: what pain points have you encountered that require you to publish different artifacts for different Spark versions? For Databricks' own spark-redshift library, I've configured our tests to build against a single version of Spark then run the tests against multiple runtime versions of Spark, simulating the experiences that users would have if you published a single non-Spark-version-specific artifact to Maven. Of course, the spark-redshift approach might be difficult if your library relies on internal private APIs that change across Spark releases.

FRosner commented 8 years ago

@JoshRosen thanks for the comment. I was thinking about the same thing, i.e. developing against one and testing against others. spark-packages publish some API compatibility indicators for different versions of Spark when you look at a release of a package. This might help there as well.

The only way to truly support a multi-version build is in maintaining multiple versions of the code (probably through branches or something), is it? But I am not sure whether it is worth the effort.

The reason why I considered doing it was that most of the sql API is still marked as experimental or alpha (esp. in 1.2.x and 1.3.x). On the common enterprise Hadoop distributions you are often stuck with an older version, while in a testing environment you might already have a new one available. But looking at the recent changes in Spark I see that the API is maturing.

So given the fact that this project is rather simple and mostly uses well documented APIs from Spark SQL, a multi-version build might not be a good choice. However, another project of mine, https://github.com/FRosner/spawncamping-dds, uses some very special internal APIs (like ScalaReflections stuff) so that it surely will break sooner or later.

This ticket was mainly about understanding what it takes and starting a discussion. So thanks for your response! What do you think?

JoshRosen commented 8 years ago

I wouldn't read quite too much into spark-packages' API compatibility indicators; the methodology used to generate them isn't very precise: it would flag your project as incompatible with Spark 1.3.0, for instance, if you built against 1.4.0 but only used APIs that were present in 1.3.0. Personally, I'd be in favor of just removing those compatibility indicators until we have a more precise checker which takes libraries' actual API usage into account (I have a side project to develop such a checker).

If I had to maintain a multi-version build, what I'd do is to hide the version-specific code behind some sort of facade / interface, then create multiple source trees in the same Git branch which contain version-specific implementations of that interface (the non-version-specific code, which is the bulk of the library, would still only be present once in the source tree). I'd then use some SBT magic to automatically reconfigure the compatibility layer's source directory based on the Spark version. In my opinion, that approach will be infinitely easier to maintain than one which requires separate Git branches.

For this particular library, I'd probably go with an approach where you compile against the minimum-supported Spark version and run tests against both that version and against newer versions. I probably wouldn't aim to support any versions prior to Spark 1.3, since maintaining a non-DataFrame version of this project seems like a huge pain.

FRosner commented 8 years ago

Agreed. Thanks for your opinion. I am busy with some other projects at the moment but will hopefully be able to work on publishing DDQ as a spark package in one or two weeks. And as long as there is no imminent need for a multi-version support I will just support the version that I am using.

FRosner commented 8 years ago

specifiedSparkVersion := sys.props.getOrElse("spark.version", sparkVersion.value)

FRosner commented 8 years ago

build is probably breaking because of this: https://docs.travis-ci.com/user/speeding-up-the-build/