datacleaner / DataCleaner

The premier open source Data Quality solution
GNU Lesser General Public License v3.0
595 stars 180 forks source link

Reduce DC build time and complexity by getting rid of Scala and Spark #1890

Open kaspersorensen opened 2 years ago

kaspersorensen commented 2 years ago

Hi all,

I am picking up DC development for a bit, after a long hiatus. And coming back to this project is making me realize how long and complex a build we have. I would like to make DC build (and thereby the overall developer experience) much nicer by simplifying it. Right now I am spending a lot of time just getting it to compile on my fresh installation. And the main culprit is something that I've noticed before: Scala, and to some extent also the Spark module. So I would suggest to simplify the developer experience by:

  1. Removing our dependency on Scala.
  2. Removing the Spark module. Or look at converting it into a separate GitHub project/extension.
kaspersorensen commented 2 years ago

I have prepared a branch to illustrate what would be removed...

https://github.com/datacleaner/DataCleaner/compare/remove-scala?expand=1

LosD commented 2 years ago

The branch looks more or less good to me, although it's been so long I even looked at Java that I'm not sure I'd take my word for it. 😊

Buuuut it's mostly code removal, so as long as it builds and all the runners still work, I guess we're good. I can't test right now, but I'll try to see if I can get some time for it later today or tomorrow.

However, I don't see deletions of the .scala files themselves? But okay, it's just an example, so I guess it doesn't matter for now.

LosD commented 2 years ago

Regarding pulling it into its own extension, I guess it would take quite a bit of refactoring to allow runners, especially ones that needs to change the system in such a major way, in extensions? If I remember correctly (which is not a given), we/you tried something like that originally, but ended up in this way because such a fundamental change to running just got too hard without quite a bit of coupling. But maybe the Scala parts themselves could be kept in the extension, while the base of the runner was kept here? Admittedly that DOES sound like a bit of a strange design, but if we think the Spark runner still has value to users, it would be a shame to lose.

kaspersorensen commented 2 years ago

I just realized that the branch is in no way ready to go :) I mean, there are a bunch of components that are just no longer included, and I guess we just didn't have integration tests for those, but they would disappear from the product if we merged that branch. But I think they're not too hard to reproduce, so that's definitely next step if we want to complete this issue.

Regarding the Spark runner. I agree it's probably not going to be easy to make it a proper extension. I was more thinking that we could make it a separate distribution. A bit like datacleaner-docker or whatever. A distribution that would include it's own Main class and would only be built to work with Spark.

I mean, the other thing is that Spark has moved on massively since this was built. I think everything will break and have to be partially rewritten if we just upgrade Spark to the latest version. But I think it's time that it does get upgraded somehow.

LosD commented 2 years ago

Ah yeah, I remember most of the Spark components being reasonably simple.

How about I finally get back to contributing (and re-jiggle my Java experience a bit) by taking at least some of them on? But it might be a few days before I get started.

kaspersorensen commented 2 years ago

Yeah give it a shot! I'm ready to cheer you on! For my part I'm gonna then look into some of the more simple-n-stupid Scala-to-Java conversions in the non-Spark areas of the code. Like the HTML rendering module and more.