broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
160 stars 71 forks source link

Prevent sample ID mangling in WGD computation #684

Closed CuriousTim closed 1 month ago

CuriousTim commented 1 month ago

When reading a file in R via read.table, the default is to convert all column names into syntactically valid identifiers. Identifiers can only start with a letter or a dot, but a dot cannot be followed by a number. To "fix" these column names, read.table will prepend them with an X. Because the column names in the WGD scoring matrix are sample IDs, which could begin with a non-letter character, sample IDs become mangled when scoring dosage biases. This results in fragmented rows in the final QC table generated by 02-EvidenceQC. This commit changes the way the dosage bias scoring script reads the WGD scoring matrix so that sample IDs are not changed.

One assumption of this commit is that all the sample IDs are unique.

Fixes #683

mwalker174 commented 1 month ago

Thanks @CuriousTim, have you tested this script? I don't think you'd need to do a full docker rebuild, but at least testing in a local environment. Want to be sure you've tested both on problematic and non-problematic sample ID sets as well.

CuriousTim commented 1 month ago

I tested the script on a batch with only problematic sample IDs and a batch with both problematic and non-problematic sample IDs. I can do some further testing for only non-problematic sample IDs and more exotic sample IDs. I also have not tested anything downstream.

mwalker174 commented 1 month ago

Those extra tests would be a good idea to rule out any possible regressions. I don't think you need to test downstream in this case, as long as the output looks correct to you.

CuriousTim commented 1 month ago

I tested a batch with all non-problematic IDs and the script works. I also tested sample IDs containing combinations of every single printable ASCII character except space (32) and DEL (127) and the only character that breaks the script is a double quote character, though that can be addressed by passing quote = "" to the read.table() call. However, given the sample ID restrictions of the pipeline, I don't think the script needs to cover those cases.