DESIGN: Pipeline and project settings (lg3.conf, env vars, ...)

HenrikBengtsson commented 5 years ago

@ivan108 wrote on 2019-05-02:

To continue discussion about best way to setup config files:

Your suggestion with conditional assignment like CONV=${CONV:-patient_ID_conversions.tsv} indeed will give us extra flexibility to overwrite config settings by env var. But does user always need such flexibility? CONV parameter is a good example when such flexibility is not needed, because you don't change it from run to run, in fact you don't need to change it at all until next project. Same is applicable to the most of other parameters.

I also see potential confusion for the users, let's say user edited local config file with different $CONV value he wants for the next runs, but there is still env var $CONV set in the previous run, which will supersede the value in config file . So, the user would start the run with the wrong conversion file!

Suggestion 1: Some variables which change often, like $PATIENT, should never be included in config files.

Suggestion 2: All parameters pointing to files or directories in config files should always have absolute path. In that case we don't need to worry readlink'ing them

Suggestion 3: All parameters in both config files should be exported, e.g. export GATK4="${LG3_HOME}/tools/gatk-4.1.1.0/gatk" In that case we only need to source config files twice: in launching scripts _run_*and PBS scripts *.pbs, but not in numerous downstream shell scripts *.sh. Also much less variables would need to be passed to shell scripts, because they are already in the environment ...

Suggestion 4: local lg3.conf file should be renamed to something else (lg3_user.conf?) and be part of repo with instructions and some initial parameters which are likely to be changed, commented out. During pipeline installation the lg3_user.conf should be somehow copied to the user working directory. May using lg3 test setup ??

What do you think? Thx!

HenrikBengtsson commented 5 years ago

Sorry for the delay. Here are some thoughts (but please know that I haven't spent too much time on it):

There are at least two types of settings:
1. Pipeline-specific settings: These basically never changed by the end user, e.g. software installation paths. In most use cases, the built-in LG3 Pipeline settings will work.
2. Project-specific settings: These change frequently, e.g. sample names, tuning parameters, reference files, ...

In the beginning, that is, in Fall 2018 when we took another stab in modernizing the LG3 Pipeline, everything was more or less hard coded. The first step was to identify what these hardcoded settings were and use, hard-coded, internal environment variables for them. When that was done, we made it possible to override the default values for these variables. That is, they were defined as FOO=${FOO:-default-value} which would allow the user to control them by setting them outside/before calling the pipeline. That's where we were before lg3.conf was introduced.

I think it is a good idea to centralize as many of these pipeline-specific env vars into $LG3_HOME/lg3.conf. Conceptually, this is just replacing:

FOO=${FOO:-abc}
BAR=${BAR:-123}

that is at the top of several script, with a:

source "$LG3_HOME/lg3.conf"

and have that lg3.conf script contain those lines of code instead. I think this process of cleaning up by moving env var settings up to $LG3_HOME/lg3.conf should continue. However, we need to be careful of not moving everything up at once, particularly not project-specific settings. For example, when you moved CONV there, we ran into problems (#128).

So, my proposal is to only move the "obvious" env vars there for now and then worry about the others later. Also, I would not worry about confusing the end user right now where they can use either do export CONV= in their own script or set it in the project-specific ./lg3.conf file. By keeping CONV=${CONV:-default} in the scripts and not moving it up to the global $LG3_HOME/lg3.conf, the use can still set CONV.

HenrikBengtsson commented 4 years ago

FYI, I've now:

updated ${LG3_HOME}/lg3.conf (see commit 785641b72) such that it does not override env vars in case they're already set previously
all run scripts, PBS scripts, and shell scripts now calls source_lg3_conf, i.e. they all acknowledge ${LG3_HOME}/lg3.conf and project-specific ./lg3.conf.

The only thing that is unstable right now is that the user must not edit ./lg3.conf while running the pipeline, e.g. between submitting a job and the job launching. I'll create a new issue about this.

UCSF-Costello-Lab / LG3_Pipeline

DESIGN: Pipeline and project settings (lg3.conf, env vars, ...) #130