Open mwalker174 opened 2 years ago
@mwalker174 is this still a thing?
@vruano Very much so
Ok, assign it to me. I cannot do it myself.
Also where can I find example of scaled up files that the script is supposed to be able to handle in practice?
@vruano Have you made any progress on this? We will need this soon and @VJalili is available to take over.
Some, I will try to put my code somewhere but I guess he may want to start from scratch. I reckon you know tomorrow 20th my last day as a broadie. Sorry I didn't contact you earlier.
I cannot push a branch into the repo (no permissions) so instead I will post my code on the cloud here gs://broad-dsde-methods/from-vrr-to-gatk-sv/process_posthoc_cpx_depth_regenotyping.py. Please copy it over asap.
It is not complete I just transcribed the first main loop, and it is rather untested (and largely uncompiled) but I made an effort to make readable and I think is worth for you to take a look and perhaps use it as an start point. ___END___
marks where the python code ends and the rest of the to-be-transcribed bash code starts as I was starting working in the second main loop.
The obvious issue with the bash script, apart that is awful to read and understand, is the vast number of individual processes that are spun to do even the simplest task. The goal is to reduce the number of such processes to just a few; although is plausible to do it all in memory I think it is okay to keep generate some intermediary files.
As you may see in my code I also try to be more efficient indexing some of the input bed files that otherwise are scanned repetitively to process genotype and interval overlaps. Perhaps indexing is not required if the input file sizes and sample count are small but makes the code more robust to problem size without adding much complication at all. So far I also have done without sys-CLI calls to bedtools (avoid spinning the additional scanning bedtool execution) as the intersection criteria is rather simple to implement yourself.
@vruano Thank you for the tips!
@mwalker174 is there any good reference run for the performance evaluation? e.g., what was the largest cohort this script was run successfully and how long it took?
Initial Python Version of Genotyping Script Description: This is an attempt to introduce the initial Python version of our genotyping script. Previously, this logic was implemented in Bash, and this PR aims to transition that logic to a more maintainable and readable Python format.
I have included the Python script here.
Please note: I have not thoroughly checked or tested this script. This submission is intended to serve as a starting point for further refinement and optimization. Feedback, suggestions, and thorough reviews are highly encouraged to ensure the quality and functionality of the code. This version of the provided script has not been tested yet.
Additional Note: I have divided the original Bash script into sections to facilitate the transition from the Bash script to Python. I've added the corresponding line numbers from the Bash script as comments within the Python script for reference and easier tracking. This should aid in understanding the structure and mapping the Python code back to its Bash counterpart.
Rooms to improve:
External Command Sanitization: Please ensure that all inputs to os.system() calls are sanitized and validated. This is crucial to prevent potential command injection vulnerabilities.
Variable Initialization: The comments in the script mention that certain variables (like GTDIR and cleaned_output_vcf) should be defined elsewhere. Ensure these variables are initialized correctly in the relevant parts of the code.
Code Repetition: The current version has repeated lines, such as prepare_sample_lists(args.FAMFILE, GTDIR) and setup_genotype_counts_header(GTDIR).
Code Optimization: To optimize the logic and methods in the script.
This script is not scaling well in larger cohorts (>10K samples) and should be reimplemented in Python.
It is called from the
ParseGenotypes
tasks inGenotypeCpxCnvs.wdl
(subworkflow call stackMakeCohortVcf.wdl
->GenotypeComplexVariants.wdl
->ScatterCpxGenotyping.wdl
->GenotypeCpxCnvs.wdl
). Test data can be found in the cohort mode Terra workspace here.