Enhance configuration mechanics

fmarten commented 7 years ago

The objectives could be:

configuration mechanics should be more standard, in the best case mimic Sparks behavior:
- We should use uppercase environment variables
- We should reuse environment variables whenever possible, compare JoSimTexts file /scripts/config/local.config.sh with $SPARK_HOME/conf/spark-env.sh.template. For example instead of creating spark_gb=8 and then passing it explicitly as cli argument to each of the ~20 bash scripts we can just use SPARK_EXECUTOR_MEMORY and SPARK_DRIVER_MEMORY
configuration should respect that there are not only different enviroments, but also different usage patterns. The same configuration should support using sbt console, custom spark-submit.sh commands and be used by all of the scripts in /scripts/*.sh
Reduce boilerplate of model parametrization in bash scripts. There should be a way to run a script with a specific model parametrization without changing the script. The paramters should be "easily" (in the sense of easy enough) to be copied to other usage patterns, i.e. unit tests, sbt console, etc.

It seems reasonable to attack this early. First because attacking it late will leave us no time to have real world experience with changes. Currently there is already a lot of boilerplate and continuing forces us to create even more, to stay consistent.

fmarten commented 7 years ago

This project might be good for inspiration: https://github.com/apache/systemml

fmarten commented 7 years ago

There seems to be three intersting places for how to create scripts that provide a good entry point into Spark.

I am not yet sure how they fit together, though.

But you can see that they favor you solution with explicitly passing Spark configuration, such as --driver-memory. What I do not like is how they have hard coded the defaults.

What I like is that they have a single entry point.

And I have a suggestion how this would be possible in our situation, even with the concerns you have mentioned (having a fast starting point for researchers with an overview of all model params and no need to write them donw manually). The solution could be extracting the model params to an extra key-value file and then solely providing this key-value file for each "method". The nice part about this idea is, that we can later regenerate such a file and include it into the output folder. (That is by the way similar to what Spark does within MLLibs model persistence.)

My main point is, that a single entry point, would reduce boilerplate and make it easier to resolve issues in the scripts.

Take for example any of the 20 scripts last argument, which is . If you open any of those scripts you see that this is sourced and then some variables are used which are never defined before. A reader might assume that this contains those variables and then assume that it is in the config folder. But it involves enough reasoning to question how explicit this is.
Another problem is the lack of naming the model parameters on the command line.

alexanderpanchenko commented 7 years ago

But you can see that they favor you solution with explicitly passing Spark configuration, such as --driver-memory.

I thought again about this and ready to say that I am very much in favor of such explicit setting spark, not via env. vars.

What I do not like is how they have hard coded the defaults.

yeah, what we do now seems to be even more advanced

create scripts that provide a good entry point into Spark.

my main bias is to make the scripts as simple as possible, which is not really the case in this project. i want them ideally to have no while or for loops, no functions, and as little ifs as possible, so even a kid (=researcher) can read such bash script. in this project, the scripts are quite complex.

The nice part about this idea is, that we can later regenerate such a file and include it into the output folder. (That is by the way similar to what Spark does within MLLibs model persistence.)

I strongly oppose "later" thing. If it is a benefit, then we need to do it now or do not even consider it. For me actually it is not clear how you will do it. Though reflection?

Please answer: which problem you are trying to solve by changing the configuration? Please answer this question very clearly and with as much details as possible. For now, I cannot really understand it and this is very important.

fmarten / JoSimText

Enhance configuration mechanics #12