Open duyanghao opened 7 years ago
Pyspark has more memory overhead than Spark in Java or Scala because Python objects are stored off the JVM's heap. One probably has to increase the executor and driver memory overhead so that the pods have enough RAM for these objects.
@duyanghao If memory-overhead is not properly set, the JVM will eat up all the memory and not allocate enough of it for PySpark to run. This problem is solved via increasing driver and executor memory overhead. I would recommend to look at this talk which elaborates on reasons for PySpark having OOM issues. For, I believe, this is more of a Spark Core tuning, regardless of resource manager.
@duyanghao status on this issue?
@ifilonenko @mccheah i try to increase overhead as you said but still failed as below:
submit with memoryOverhead
:
--conf spark.driver.memory=1024m \
--conf spark.driver.cores=2 \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=1024m \
--conf spark.executor.cores=2 \
--conf spark.kubernetes.driver.memoryOverhead=10240m \
--conf spark.kubernetes.executor.memoryOverhead=10240m \
...
driver and executor pod resource:
Limits:
cpu: 1
memory: 11Gi
Requests:
cpu: 1
memory: 1Gi
driver docker aborts with OOM(137) as below(dmesg_all):
Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 137
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:103)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
[6928402.630093] python invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=841
...
[61086.851530] Task in /docker/73792fab2937c40a01556757fcd1381a871b97d4c9402e7c3320802a91d1ef04 killed as a result of limit of /docker/
73792fab2937c40a01556757fcd1381a871b97d4c9402e7c3320802a91d1ef04
[61086.851531] memory: usage 11534336kB, limit 11534336kB, failcnt 361261
...
[61086.851599] Memory cgroup out of memory: Kill process 29052 (python) score 1920 or sacrifice child
[61086.851600] Killed process 28675 (python) total-vm:78213124kB, anon-rss:11494652kB, file-rss:2768kB
Not sure I follow correctly, but in your first example didn't you set it 20gb instead of your last try which is 11gb?
@tnachen Here is 20GB test:
memoryOverhead
:
--conf spark.driver.memory=1024m \
--conf spark.driver.cores=2 \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=1024m \
--conf spark.executor.cores=2 \
--conf spark.kubernetes.driver.memoryOverhead=20480m \
--conf spark.kubernetes.executor.memoryOverhead=20480m \
...
Limits:
cpu: 1
memory: 21Gi
Requests:
cpu: 1
memory: 1Gi
Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 137
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:103)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
[166152.253682] python invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=985
...
[166152.253782] Task in /docker/de2e07bd8155692babd2ea6fb2e37b62ce2bcaeace2d463f803430c16eb80b40 killed as a result of limit of /docker
/de2e07bd8155692babd2ea6fb2e37b62ce2bcaeace2d463f803430c16eb80b40
[166152.253785] memory: usage 22020096kB, limit 22020096kB, failcnt 121663
...
[166152.253887] Memory cgroup out of memory: Kill process 17989 (python) score 1910 or sacrifice child
[166152.253912] Killed process 17612 (python) total-vm:78213132kB, anon-rss:21991236kB, file-rss:2768kB
@mccheah @ifilonenko @tnachen Any suggestions for this problem?
So one thing which I wish we had done in Spark YARN is bumping up the overhead automatically when we are running Python code (still requires tuning for some cases but reasonable defaults ftw). What do you think @ifilonenko? (I have a goal of being able to take that part of my standard Python Spark talks in a year, but that might be unwarranted optimism :p :)).
@holdenk That definitely makes sense considering how much memory the JVM eats up. What would be a reasonable "bumped-up" over-head amount. Should this be calculated based on cluster-configurations, user-passed in configs, or just hard-coded values that we set.
@holdenk @ifilonenko what do you guys think about the above python oom
problem?
pi.py
:when i run spark pi using
pyspark
with par100000
, driver aborts as below:And driver docker aborts with OOM(137) as below(
dmesg_all
):Addition: both driver and executor are allocated with 2cores and 20G memory as below:
SparkPi.scala
:But i can successfully run spark pi using
scala
jar with the same par100000
, and the resource allocated is much less than that ofpyspark
as below:I am not familiar with
pyspark
, but willpyspark
incur a substantial performance(especially inmemory
) overhead?