Closed emmyoop closed 1 year ago
As one of the original authors, I'd love to help with moving to a newer (and dbt-labs maintained) Docker image.
Update: so, I spent a good amount of time, trying to set up thrift with an Apache spark, docker image. I've gotten pretty close, and I have a thrift server set up.
But, it seems like the really hard stuff is the networking across the different nodes. I've really been unable to do that which leads me to believe the "T-shirt size" of this initiative is going to be closer to a hard complexity item. It's really just a matter of learning enough about how do use items are scoped to pick the right path forward.
Incidentally, I think that the time it would take to stabilize the existing Docker image is really just not worth it compared to digging in deeply: as I write the new Docker image, I am seeking to clarify specific design decisions and also make it so that anyone can learn what is going on in the run time. That to me is a far more impactful piece of work, than if we were just to trim the edges around the existing image.
I'll be talking more about this in our sync tomorrow.
High Level Task
Can we use an alternative to
godatadriven/pyspark:3.1
?Acceptance Criteria
godatadriven/pyspark:3.1
something maintained by apache spark.Details
Can we use official apache/spark containers instead?
Can we try swapping them out with the current circleCI setup? Maybe it just works 😬
Additional Notes
More details available in #386.
It's important to note we may want to just continue with our current pattern if this is a big lift.