apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
680 stars 127 forks source link

`spark.comet.memory.overhead.min` not respected when submitting jobs with Comet with Spark on Kubernetes #605

Open lmouhib opened 3 weeks ago

lmouhib commented 3 weeks ago

Describe the bug

Currently when submitting a job on kubernetes, the total memory of the driver or executor is the sum of the memory defined in the spark configuration and the overhead (spark.{executor|driver}.memoryOverhead+ spark.{executor|driver}.memory). This does not included the default values defined by spark.comet.memory.overhead.factor.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

comphead commented 3 weeks ago

Thats true, thanks for raising this. Spark Kubernetes pods respect Spark params but has no any idea about Comet params for now. Once pod memory allocated its not possible to change it. Since Comet has no control over Spark params, I'm inclining into including spark.comet.memory.overhead.factor into spark memoryOverhead but open to other propositions

lmouhib commented 3 weeks ago

I am inclined to including it to the spark overhead, however, I am not sure to understand why its not taken into consideration here. That line of code should provide the new value of the memory overhead, which is then passed to spark and the RM to create the pod template that gets applied to create the pod.

comphead commented 3 weeks ago

I think what might happen is the executor pod already started by the time Comet tries to tweak the memory and once the pod is up, its not possible to change the allocated memory size. What Comet does is correct, but I have some feeling it should have done earlier

viirya commented 3 weeks ago

If CometDriverPlugin is loaded, it should overwrite executor overhead by existing executor overhead + Comet overhead.

lmouhib commented 3 weeks ago

We need to document the CometDriverPlugin, its not something that is mentioned int he configurations or any page of the documentation.

I did set the spark.plugins to org.apache.spark.CometPlugin, and still did not observe any change in the executor container memory configuarion.

comphead commented 3 weeks ago

I'll run some tests on this today

lmouhib commented 2 weeks ago

How is the documentation for the configuration generated? The only mention there is of the plugin is in the code base itself. I can open a PR to add the conf, if its done in the documentation itself and not auto generated.

comphead commented 1 week ago

Depends on https://github.com/apache/datafusion-comet/issues/643

comphead commented 2 days ago

Depends on #689