apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Use paths to read small local files instead of URIs #477

Closed mccheah closed 7 years ago

foxish commented 7 years ago

@mccheah, can you add a bit more of a description to the PR here? Not sure what the previous issue was and what this solves.

ash211 commented 7 years ago

When we tried to use this small files feature in our app, we saw this exception in the submission client:

Exception in thread "main" java.io.FileNotFoundException: file:/path/to/logback.xml (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at org.spark_project.guava.io.Files$FileByteSource.openStream(Files.java:124)
    at org.spark_project.guava.io.Files$FileByteSource.openStream(Files.java:114)
    at org.spark_project.guava.io.ByteSource.read(ByteSource.java:220)
    at org.spark_project.guava.io.Files$FileByteSource.read(Files.java:141)
    at org.spark_project.guava.io.Files.toByteArray(Files.java:355)
    at org.apache.spark.deploy.kubernetes.submit.submitsteps.MountSmallLocalFilesStep$$anonfun$3.apply(MountSmallLocalFilesStep.scala:46)
    at org.apache.spark.deploy.kubernetes.submit.submitsteps.MountSmallLocalFilesStep$$anonfun$3.apply(MountSmallLocalFilesStep.scala:45)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.deploy.kubernetes.submit.submitsteps.MountSmallLocalFilesStep.configureDriver(MountSmallLocalFilesStep.scala:45)
    at org.apache.spark.deploy.kubernetes.submit.Client$$anonfun$run$1.apply(Client.scala:93)
    at org.apache.spark.deploy.kubernetes.submit.Client$$anonfun$run$1.apply(Client.scala:92)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.deploy.kubernetes.submit.Client.run(Client.scala:92)
    at org.apache.spark.deploy.kubernetes.submit.Client$$anonfun$run$5.apply(Client.scala:189)
    at org.apache.spark.deploy.kubernetes.submit.Client$$anonfun$run$5.apply(Client.scala:182)
    at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2566)
    at org.apache.spark.deploy.kubernetes.submit.Client$.run(Client.scala:182)
    at org.apache.spark.deploy.kubernetes.submit.Client$.main(Client.scala:202)
    at org.apache.spark.deploy.kubernetes.submit.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:772)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
foxish commented 7 years ago

Thanks for elaborating @ash211. It all seems confined to that submission step, which seems like a good idea. We should still look at making the init container terminate quicker perhaps? While the small files fix is expedient, I think it might not be usable for the general case given the limitations.

mccheah commented 7 years ago

rerun integration tests please

ash211 commented 7 years ago

Qualitatively, running init containers on our newest clusters (1.7.2 based I think) feels faster but I don't have benchmarks to quantify it

mccheah commented 7 years ago

This happens because SparkSubmit translates all paths into URIs before passing them along to the submission client implementation. We were just providing paths, but they arrive at the small files step with the file:// scheme.