apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

documentation on resource staging server #386

Open luck02 opened 7 years ago

luck02 commented 7 years ago

The docs at: https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#dependency-management don't do a good job describing what this is and how it works. I've also read / searched: https://docs.google.com/document/d/1_bBzOZ8rKiOSjQg78DXOA3ZBIo_KkDJjqxVuq0yXdew/edit#heading=h.22iurepifhgt

From an end user standpoint this is problematic as I am not sure which problems this is intended to solve. If there are fuller docs I'd be happy to edit what's currently there.

This came up when I was trying to figure out the best way to load environment variables and I was trying to decide if it was best to bake them into the images or to provide them some other way (kub secrets / config maps).

Thanks!

luck02 commented 7 years ago

on further reflection, there doesn't seem to be a way to use config maps / secrets. It would seem we'd want some way to inject env variables outside of baking into the images so as to support the same image in different environments. Is there a mechanism in the resource staging server to provide for that? If not it could be useful to provide a more kubernetes specific approach ie be able to specify templates for this purpose.

luck02 commented 7 years ago

So after doing some more investigation it looks like PodPresets may be a way to go. Apply a label to our spark jobs and then provide a PodPreset with the env variables we need. Possibly problematical as that is v1alpha1 API which may not be something we can float in production.

luck02 commented 7 years ago

There's still basically no documentation on the resource staging server. I'm looking at what mechanisms I can use to get my dependencies available to spark, hadoop-aws in this case.

I could bake that into the image, and probably will. But it looks like the resource server is intended for this use case but there's little to no documentation on it.

Thanks!

luck02 commented 7 years ago

Found this: Should be linked in the top level docs ? https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/architecture-docs/submission-client.md

mccheah commented 7 years ago

@luck02 custom environment variables I believe are not supported right now, but we could easily add configuration options for that. Architecture documentation has been improved somewhat but there might still be gaps there.

From an end user standpoint this is problematic as I am not sure which problems this is intended to solve. If there are fuller docs I'd be happy to edit what's currently there.

A valid point, but in at least the YARN docs there isn't much description on how it works (https://spark.apache.org/docs/latest/running-on-yarn.html) and for both YARN and Mesos there isn't a description of the motivation for using the cluster manager in question. I think understanding when each cluster manager is appropriate to use is not so much dependent on Spark, but rather an understanding of the cluster manager itself. In other words, I think the Kubernetes, YARN, and Mesos documentation is the right place to look when considering which cluster manager to use, and not Spark's documentation.

There's still basically no documentation on the resource staging server. I'm looking at what mechanisms I can use to get my dependencies available to spark, hadoop-aws in this case.

I believe we discuss this here.

luck02 commented 7 years ago

A valid point, but in at least the YARN docs there isn't much description on how it works (https://spark.apache.org/docs/latest/running-on-yarn.html) and for both YARN and Mesos there isn't a description of the motivation for using the cluster manager in question. I think understanding when each cluster manager is appropriate to use is not so much dependent on Spark, but rather an understanding of the cluster manager itself. In other words, I think the Kubernetes, YARN, and Mesos documentation is the right place to look when considering which cluster manager to use, and not Spark's documentation.

Slightly confused, I was referring to documentation on resource staging server Appologies. I ended up using podpresets to provide my env variables but am happy to move to any other palatable option.

There's still basically no documentation on the resource staging server. I'm looking at what mechanisms I can use to get my dependencies available to spark, hadoop-aws in this case.

I believe we discuss this here.

I don't think it is described there. There appears to be no language that spells out what the resource staging server does or how it does it in those paragraphs.

from: docs on dependency management

Application dependencies that are being submitted from your machine need to be sent to a resource staging server that the driver and executor can then communicate with to retrieve those dependencies.

What's missing from that section is language that clearly describes what the resource server does to facilitate this.

What I found eventually was this: https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/architecture-docs/submission-client.md which actually does have language that was useful for me and I was able to put it together and close the last few unknowns and get my jobs functional. Specifically here was the description of how the resource staging server works and what it accomplishes:

Local jars and files are compacted into a tarball which are then uploaded to the resource staging server. The submission client then knows the secret token that the driver and executors must use to download the files again. These secrets are mounted into an init-container that runs before the driver and executor processes run, and the init-container downloads the uploaded resources from the resource staging server.

Is there way to provide a set of environment variables via this same approach? I see the following issues address this: https://github.com/apache-spark-on-k8s/spark/pull/424 however I'm not sure if I want to attach 30 environment variables to the spark submit client, in an ideal world i'd be able to provide a config map. Right now as I mention above I'm using a podpreset.

If it would help I'd be happy to just submit a documentation PR that would describe what I consider the shortfalls to be.

mccheah commented 7 years ago

What's missing from that section is language that clearly describes what the resource server does to facilitate this. What I found eventually was this: https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/architecture-docs/submission-client.md which actually does have language that was useful for me and I was able to put it together and close the last few unknowns and get my jobs functional. Specifically here was the description of how the resource staging server works and what it accomplishes:

I'm not sure if the user documentation should be describing exactly how the resource staging server works. This should be an implementation detail for end users. I suppose the main bit of information that's important for an administrator to decide how to deploy the server is the storage backend that's holding the files, since that requires possibly provisioning volumes, etc. But I don't think we need too much detail in the user documentation since the resource staging server should more or less be abstracted away from the user. This should be a component that should be similar to the external shuffle service in that one needs to know how to install it but does not need to know how it works. Feel free to submit a documentation PR to suggest otherwise, but in doing so we should probably avoid going into too much technical detail and being too specific.

Is there way to provide a set of environment variables via this same approach? I see the following issues address this: #424 however I'm not sure if I want to attach 30 environment variables to the spark submit client, in an ideal world i'd be able to provide a config map. Right now as I mention above I'm using a podpreset.

In this case I think that the PodPreset is the correct mechanism to use for this. The submission client is mainly for users who are deploying the application via spark-submit and thus need the API that spark-submit provides to be translated into the driver pod that is deployed. With that in mind, we should be supporting environment variables in the same way that YARN and Mesos do, with configuration parameters similar to spark.appMasterEnv.[environmentVariableName] (see here).

We discussed trying to have arbitrary pod YAML "templates" in https://github.com/apache-spark-on-k8s/spark/issues/38 but we concluded that Pod Presets are the correct approach here. Basically, spark-submit itself shouldn't be expected to handle every single feature Kubernetes supports for pods. It should support the ones that are reasonable to be expected given the precedents set by the other cluster managers supported by spark-submit. Anything more than that, I would lean towards adding to pod presets or in some simpler cases adding supported configuration parameters to the submission client. But I would say that the feature set that is "reasonable" for spark-submit to support is not entirely well defined - we've introduced a lot of custom features with e.g. Kerberos support and mounting the driver's Kubernetes credentials.