apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Followup to envvars in driver/executor PR #449

Open ash211 opened 7 years ago

ash211 commented 7 years ago

Sorry I was slow to review and had post-merge comments for https://github.com/apache-spark-on-k8s/spark/pull/395#pullrequestreview-57641643 which was moved to https://github.com/apache-spark-on-k8s/spark/pull/424 and merged there (on branch-2.2-kubernetes instead of 2.1)

The points were:

Thoughts?

cc @tangzhankun

mccheah commented 7 years ago

I think if environment is duplicated then Kubernetes would throw an error.

tangzhankun commented 7 years ago

@ash211 For minor docs change, could you make it more clear? For potential conflict, I implemented the conflict check at first but removed for below two reasons:

  1. It seems that spark core also doesn't do this checking after executor env parsed from spark-submit options
  2. Our k8s internal envs doesn't have a pattern for now so the detection login is a bit ugly and not stable to me. And even we detected the conflicts, the problem comes for how to handling the conflict. Reporting to user about the conflicts and quit? or just leave a warning that override is invalid and go ahead?

Maybe document a hint for users about this is a better idea?

@mccheah I did a test on duplicated custom envs, it won't fail and the environment is unique in container. But one interesting issue is that when we have space characters in any spark-submit conf options, the process will fail (not only when setting envs). This issue has been filed in https://github.com/apache-spark-on-k8s/spark/issues/440

mccheah commented 7 years ago

I did a test on duplicated custom envs, it won't fail and the environment is unique in container.

Which environment variable does it select then? We can probably add a validation after the pod has been finally constructed. The validation shouldn't be a configuration step.

tangzhankun commented 7 years ago

@mccheah It'll use the last value when setting custom variable and overriding internal spark env. You are saying a validation for conflicts with internal environment key?

mccheah commented 7 years ago

We should probably validate this and throw an error. For now it can just be checking the list of environment variables and throwing an any that are duplicates, and an error message can be like "[Driver/Executor] environment variable X was given multiple values: [values]. If you did not set this multiple times, they might have been set by spark-submit."

erikerlandson commented 7 years ago

I agree w/ @mccheah, multiple settings may be a configuration bug and silently overwriting could make it hard to find

tangzhankun commented 7 years ago

@mccheah @erikerlandson sorry for the late reply. I agree that at lease an message on overwriting should be thrown out. I'll add a check later.