Rewrite process_posthoc_cpx_depth_regenotyping.sh - Githubissues

broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing

https://broadinstitute.github.io/gatk-sv/

BSD 3-Clause "New" or "Revised" License

171 stars 72 forks source link

Rewrite process_posthoc_cpx_depth_regenotyping.sh #403

Open mwalker174 opened 2 years ago

mwalker174 commented 2 years ago

This script is not scaling well in larger cohorts (>10K samples) and should be reimplemented in Python.

It is called from the ParseGenotypes tasks in GenotypeCpxCnvs.wdl (subworkflow call stack MakeCohortVcf.wdl -> GenotypeComplexVariants.wdl -> ScatterCpxGenotyping.wdl -> GenotypeCpxCnvs.wdl). Test data can be found in the cohort mode Terra workspace here.

vruano commented 1 year ago

@mwalker174 is this still a thing?

mwalker174 commented 1 year ago

@vruano Very much so

vruano commented 1 year ago

Ok, assign it to me. I cannot do it myself.

vruano commented 1 year ago

Also where can I find example of scaled up files that the script is supposed to be able to handle in practice?

mwalker174 commented 1 year ago

@vruano Have you made any progress on this? We will need this soon and @VJalili is available to take over.

vruano commented 1 year ago

Some, I will try to put my code somewhere but I guess he may want to start from scratch. I reckon you know tomorrow 20th my last day as a broadie. Sorry I didn't contact you earlier.

vruano commented 1 year ago

I cannot push a branch into the repo (no permissions) so instead I will post my code on the cloud here gs://broad-dsde-methods/from-vrr-to-gatk-sv/process_posthoc_cpx_depth_regenotyping.py. Please copy it over asap.

It is not complete I just transcribed the first main loop, and it is rather untested (and largely uncompiled) but I made an effort to make readable and I think is worth for you to take a look and perhaps use it as an start point. ___END___ marks where the python code ends and the rest of the to-be-transcribed bash code starts as I was starting working in the second main loop.

The obvious issue with the bash script, apart that is awful to read and understand, is the vast number of individual processes that are spun to do even the simplest task. The goal is to reduce the number of such processes to just a few; although is plausible to do it all in memory I think it is okay to keep generate some intermediary files.

As you may see in my code I also try to be more efficient indexing some of the input bed files that otherwise are scanned repetitively to process genotype and interval overlaps. Perhaps indexing is not required if the input file sizes and sample count are small but makes the code more robust to problem size without adding much complication at all. So far I also have done without sys-CLI calls to bedtools (avoid spinning the additional scanning bedtool execution) as the intersection criteria is rather simple to implement yourself.

VJalili commented 1 year ago

@vruano Thank you for the tips!

VJalili commented 1 year ago

@mwalker174 is there any good reference run for the performance evaluation? e.g., what was the largest cohort this script was run successfully and how long it took?

shadizaheri commented 1 year ago

Initial Python Version of Genotyping Script Description: This is an attempt to introduce the initial Python version of our genotyping script. Previously, this logic was implemented in Bash, and this PR aims to transition that logic to a more maintainable and readable Python format.

I have included the Python script here.

Please note: I have not thoroughly checked or tested this script. This submission is intended to serve as a starting point for further refinement and optimization. Feedback, suggestions, and thorough reviews are highly encouraged to ensure the quality and functionality of the code. This version of the provided script has not been tested yet.

Additional Note: I have divided the original Bash script into sections to facilitate the transition from the Bash script to Python. I've added the corresponding line numbers from the Bash script as comments within the Python script for reference and easier tracking. This should aid in understanding the structure and mapping the Python code back to its Bash counterpart.

Rooms to improve:

External Command Sanitization: Please ensure that all inputs to os.system() calls are sanitized and validated. This is crucial to prevent potential command injection vulnerabilities.
Variable Initialization: The comments in the script mention that certain variables (like GTDIR and cleaned_output_vcf) should be defined elsewhere. Ensure these variables are initialized correctly in the relevant parts of the code.
Code Repetition: The current version has repeated lines, such as prepare_sample_lists(args.FAMFILE, GTDIR) and setup_genotype_counts_header(GTDIR).
Code Optimization: To optimize the logic and methods in the script.