broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
170 stars 70 forks source link

Rewrite the SplitVariants task command in TasksGenotypeBatch.wdl to call svtk only once #618

Closed kirtanav98 closed 9 months ago

kirtanav98 commented 10 months ago

This pull request was created to improve the efficiency of the SplitVariants task command in the TasksGenotypeBatch.wdl. The SplitVariants task took an input bed file and a number of splits, called svtk to produce multiple bed files and then used awk to filter and split each bed file based on the number of required splits, proving inefficient with both memory and time and cost. Instead, this request is to convert the actions of the task into a python script which is called on the input vcf file with the input number of lines per file as well as the variable indicating if bca's should be examined. Svtk is called only once on the vcf to produce a bed file. Each line of the bed file is read only once; the appropriate line is then added to a corresponding text file with the appropriate prefix based on which of the conditions it matches: gt5kb (greater than 5kb in length), lt5kb (less than 5kb in length), and if the bca's are indicated to be examined, if it is a bca or insertion. Once each file has reached the maximum capacity, it is closed an a new file with the same prefix and a new suffix is opened. This implementation is an improvement in memory and time since svtk is called on the vcf only once, and each line in the bed file is only parsed once as well. The docker images were updated and implemented. The appropriate changes were made to the TasksGenotypeBatch. wdl where the SplitVariants task is defined, as well as in GenotypeBatch.wdl. All changes passed validation with womtool and cromwell. Testing involved changing the parameters of bca to True and False with a combination of different input values for n_per_split. Using a standard bed file with the default parameter of n_per_split=5000 and checking for BCAs, the runtime was about 1.5 times faster than the previous task using the awk commands. The time increases as the n_per_split increases as well but not drastically.