This PR adds a preview version of the de-novo pipeline, initially implemented by @alsanju, @nicolecalamari, and Steph Hao. This PR keeps the original functionality intact and makes the minimally required changes listed below.
Add a Dockerfile, which was created by taking a brute-force approach trying to make an image as the experimental image used during the initial development phase that was created without a Dockerfile.
Consistently add set -exuo pipefail to all the tasks.
Fix bugs previously ignored/silenced.
Replace all head and tail commands with their awk equivalent since head/tail fail pipefail.
Add || true to the grep calls to address cases when it fails to find a pattern in its input.
I rewrote the tabix_query method because of issues such as “Popen” yielding by char instead of a line and errors such as “device is busy.”
Replace deprecated syntax with their current latest equivalent (e.g., ${sep=" " input} to ~{sep=" " input}, replacing $ with ~).
Replace all python3.x (e.g., python 3.6 and 3.9) calls with python, hence consistently using the version of python that comes with the Docker image (i.e., python3.10), and implement the minimally necessary changes in the python scripts of the pipeline.
Update for consistency in setting up runtime attributes and, for instance, changing templates such as "~{select_first ...} GB" to select_first ... + " GB".
Refactor variables named with reserved words (e.g., variables called list or file are all refactored).
Refactor list/tuple-like returns in Python scripts that are not lists or tuples (e.g., refactor return (a) to return a).
Remove unnecessary parenthesis from if conditions and for loops (e.g., refactor if (a == 10) to if a==10).
Anti-pattern comparisons are replaced with their correct equivalent (e.g., refactor a == None with a is None).
Simplify asserts or loops when possible (e.g., change if a == True to if a).
Improvements in syntax; refactoring all variables, classes, workflows, methods, tasks, and file names to adhere to the same conventions as those used in the GATK-SV pipeline.
Match runtime struct naming with the convention used in the GATK-SV pipeline;
Improvements in style, such as consistent use of indentation and line breaks.
Remove commented-out, temporary, or unused code.
Update all the scripts and WDLs to adhere to PEP8, miniwdl, and womtool requirements.
Improvements to be implemented in the follow-up PRs:
Currently, the scripts lack any error handling, making debugging challenging. Hence, the scripts must be updated with minimal error/exception handling.
In Python scripts, many variable names shadow outer scope variables, which may lead to unwanted overrides. Such variables need to be refactored.
Python scripts contain confusing and anti-pattern use of NA; for instance, dataframe columns are set to "NA" that may not function as intended when asserting for np.nan in that column.
Python scripts contain methods that not all their execution path returns a non-void value; such methods need to be updated to provide valid returns for all their execution path.
Testing
Successfully tested on the methods' team Cromwell server: 60d3c93f-7c14-420c-9b3b-381ee4d6e789
This PR adds a preview version of the de-novo pipeline, initially implemented by @alsanju, @nicolecalamari, and Steph Hao. This PR keeps the original functionality intact and makes the minimally required changes listed below.
Dockerfile,
which was created by taking a brute-force approach trying to make an image as the experimental image used during the initial development phase that was created without aDockerfile.
set -exuo pipefail
to all the tasks.head
andtail
commands with theirawk
equivalent sincehead
/tail
failpipefail.
|| true
to thegrep
calls to address cases when it fails to find a pattern in its input.tabix_query
method because of issues such as “Popen” yielding by char instead of a line and errors such as “device is busy.”${sep=" " input}
to~{sep=" " input}
, replacing$
with~
).python3.x
(e.g., python3.6
and3.9
) calls withpython
, hence consistently using the version of python that comes with the Docker image (i.e.,python3.10
), and implement the minimally necessary changes in the python scripts of the pipeline."~{select_first ...} GB"
toselect_first ... + " GB"
.list
orfile
are all refactored).return (a)
toreturn a
).if
conditions andfor
loops (e.g., refactorif (a == 10)
toif a==10
).a == None
witha is None
).if a == True
toif a
).Improvements to be implemented in the follow-up PRs:
NA
; for instance, dataframe columns are set to"NA"
that may not function as intended when asserting fornp.nan
in that column.Testing
Successfully tested on the methods' team Cromwell server:
60d3c93f-7c14-420c-9b3b-381ee4d6e789